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Final  Report  (ESE  DARPA  Program)  -  Summary: 


Summary:  In  this  DARPA  program,  we  have  developed  a  robust  design 
methodology  to  scale  power  supply  voltages  to  levels  as  low  as  250mV,  reducing 
the  energy  dissipation  of  digital  computation  by  an  order  of  magnitude.  We  have 
demonstrated  both  logic  (standard  cells)  and  memory.  We  have  explored  the  use  of 
parallelism  to  maintain  performance  at  reduced  power  supply  voltages.  This 
concept  was  demonstrated  with  a  UWB  baseband  processor.  We  have  developed  a 
DC-DC  converter  to  efficiently  deliver  sub-threshold  voltage  and  minimize  the 
power  dissipation  of  an  arbitrary  digital  circuit.  We  have  demonstrated  9  test 
chips  in  state-of-the-art  65nm,  90nm  and  0.18pm  CMOS  technologies.  All  test 
chips  were  fabricated  for  free  (primarily  by  TI). 

The  key  outputs  of  this  program  are: 

■  Sub-threshold  digital  library  was  developed  (for  ASICs)  in  TI’s  65-nm 
CMOS 

□  62  cells,  Vdd  =  250  mV,  (0  -  70°C),  process  variation  tolerant 

□  Integrated  into  commercial  design  flow  and  FIR  filter  test  chip  was 
demonstrated 

■  Developed  Sub-VT  10T  and  8T  SRAM  in  65-nm  CMOS 

□  Sense  amplifier  redundancy  (8T)  was  employed  to  operate  with 
supply  of  350mV  and  data  retention  <  300mV 

■  Demonstrated  a  UWB  radio  baseband  processor  at  Vdd  =  400  mV 

□  iOOMb/s  @  2.0  mW  (20pJ/bit),  90-nm  CMOS  process 

□  Parallelism  employed  (20X)  :  25  MHz,  620  correlators,  4  matched 
filters 

■  Demonstrated  a  UWB  Analog-to-Digital  Converter 

□  500Msamples/sec,  65-nm  CMOS  process 

□  Parallelism  of  36  converters  enables  sub-threshold  biasing 

■  Minimum  energy  tracking  loop  with  embedded  dc-dc  converter 

□  DC-DC  converter  delivers  voltages  down  to  250  mV  in  65nm 
CMOS 

□  50-100%  energy  savings  by  tracking  &  adjusting  minimum  energy 
point  of  operation 

■  Demonstrated  a  fully  integrated  switched  sub-VT  DC-DC  converter 

Publications: 

The  following  publications  directly  resulted  from  the  DARPA  ESE  funding: 

2006  ISSCC^ 

1.  B.  Calhoun,  A.  Chandrakasan  “A  256kb  Sub-threshold  SRAM  in  65-nm  CMOS” 

2006  ICASSP 

2.  V.  Sze,  R.  Blazquez,  M.  Bhardwaj,  A.  Chandrakasan,  “An  energy  efficient  sub 
threshold  baseband  processor  architecture  for  pulsed  ultra-wideband 
communications" 

2006  IEEE  Symposium  on  VLSI  Circuits 


3.  B.  Ginsburg,  A.  Chandrakasan,  "A  500MS/s  5b  ADC  in  65nm  CMOS” 

2006  International  Symposium  on  Low  Power  Electronics  and  Design 

4.  J.  Kwong,  A.  Chandrakasan,  "Variation-Driven  Device  Sizing  for  Minimum 
Energy  Sub-threshold  Circuits” 

5.  B.  H.  Calhoun,  A.  Wang,  N.  Verma,  A.  P.  Chandrakasan,  "Sub-threshold  Design: 
The  Challenges  of  Minimizing  Circuit  Energy”  (invited) 

6.  B.  P.  Ginsburg  and  A.  P.  Chandrakasan,  "500-MS/s  5-b  ADC  in  65-nm  CMOS 
With  Split  Capacitor  Array” 

2007  ISSCC 

7.  Y.  Ramadass,  A.  Chandrakasan,  “Minimum  energy  tracking  loop  with  embedded 
dc-dc  converter  delivering  voltages  down  to  250  mV  in  65-nm  CMOS” 

2007  Beatrice  Winner  Editorial  \ward 

8.  N.  Verma,  A.  Chandrakasan,  “A  65-nm  sub-Vt  SRAM  employing  sense-amplifier 
redundancy” 

9.  Vivienne  Sze,  Anantha  P.  Chandrakasan,  “Design  of  an  Ultra-Low  Voltage 
UWB  Baseband  Processor” 

GOMACTech  (Government  Microcircuit  Applications  &  Critical  Technology 

Conference)  2007 

10.  B.  Ginsburg,  V.  Sze,  A.P.  Chandrakasan,  "A  Parallel  Energy  Efficient  1 00Mbps 
Ultra- Wideband  Radio  Baseband,"  March  2007. 

1 1.  A.  Wang,  B.  H.  Calhoun,  N.  Verma,  J.  Kwong,  A.  Chandrakasan,  "Ultra- 
Dynamic  Voltage  Scaling  for  Energy  Starved  Electronics,"  (poster  presentation) 

PESC  2007  (June  2007) 

12.  Yogesh  K.  Ramadass,  Anantha  P.  Chandrakasan,  “Voltage  Scalable  Switched 
Capacitor  DC-DC  Converter  for  Ultra-Low-Power  On-chip  Applications” 

ISLPED  2007 

13.  V.  Sze,  A.  Chandrakasan,  "A  0.4V  UWB  Baseband  Processor"  to  be  presented 

Journal  Papers: 

14.  Benton  H.  Calhoun,  Anantha  P.  Chandrakasan,  "Static  Noise  Margin  Variation 
for  Sub-threshold  SRAM  in  65nm  CMOS,"  IEEE  Journal  of  Solid-State  Circuits, 
vol.  41,  no.  7,  pp.  1673-1679,  July  2006. 

15.  Benton  H.  Calhoun,  Anantha  P.  Chandrakasan,  IEEE  Journal  of  Solid-State 
Circuits, 

16.  B.  P.  Ginsburg  and  A.  P.  Chandrakasan,  "500-MS/s  5-bit  ADC  in  65-nm  CMOS 
W'ith  Split  Capacitor  Array  DAC,"  IEEE  J.  Solid-State  Circuits,  vol.  42,  no.  4,  pp. 
739-747,  Apr,  2007. 

17.  N.  Verma,  A.  P,  Chandrakasan,  "A  256kb  65nm  8T  Sub-Threshold  SRAM 
employing  Sense-Amplifier  Redundancy",  accepted  to  the  IEEE  J.  Solid-State 
Circuits. 
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Agenda 

■  0:30-9:00  breakfast 

■  9:00AM-9:30AM  Program  Overview  -  Prof,  Anantha  Chandrakasan 

■  9:30AM-1Q:15AM  Sub-VT  SRAM  (10-T  and  8-T)  -  Naveen  Verma 

■  10:15AM-1 0:30AM  Break 

■  10:30 AM-1 1:00AM  Sub-VT  library  design  and  test  chip  results  - 
Joyce  Kwong 

■  1 1 :00  AM-1 1 :30AM  Sub-Vt  switching  Converter  Design  -  Yogesh 
Ramadass 

m  11:30AM-1 2:00PM  A  Parallel  Energy  Efficient  100  Mbps  Ultra- 
Wideband  Radio  Baseband  -  Brian  Ginsburg  and  Vivienne  Sze 

a  12:00-1:00  Lunch  and  Discussion 

a  1:Q0PM-1 :30PM  Ultra-low-power  UWB  {FCRP  work)  -  David 
Wentzloff 

a  1:30PM-2:00PM  Discussion 
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Energy  Starved  Electronics  (ESE) 


^  Technical  Approach 

Develop  and  characterize  ULP  devices  for 
optimum  performance  Jn  sub-three  hold  operation 
Implement  strategies  to  minimize  reliability  and 
performance  degradation  for  ULP  circuits 
Explore  methods  to  increase  computational 
throughput  with  massive  parallelism 


^  Goat 

Scale  operating  voitage  to  <  300 mV  to  reduce 
power  consumption  (Ultra  Low  Power  operation) 
of  conventional  signal  processor  electronics  by  > 
1 0X  while  maintaining  comparable  throughput 

^  Technical  Challenges 

Performance  degradation  at  reduced  voltage 
Increased  circuit  variability  and  error  rate  for  low 
voltage  operation 

Reduce  leakage  current  of  deep  sub-micron 
devices  to  reduce  power  consumption 


^  Deliverables 

Device  technology  capable  of  ULP  operation 
Circuits  able  to  operate  reliably  at  <  340 mV 
Design  techniques  to  provide  processing 
throughput  comparable  to  conventional 
electronics  operating  at  standard  voltage 


^  Military  Impact 

Extended  operation  wireless  sensor  networks 
Lower  power,  man  portable  comm,  systems 
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Next  Generation  Low-Power  Systems: 
Extreme  Sub-threshold  Operation 


Slower,  low  power 


Fast,  high  power 


weak  inversion --Strong  inversion 
Region  Region 


m;rFii!l;nl  CompLlAtt-gn 
Minimum  Qpif  than 

a  0,1 


1  t.S 

Gate  Voltage  (V) 


■  Goal;  Enable  ultra-low  power  digital 
circuits  operating  in  the  sub- 
threshold  regime  while  maintaining 
adequate  performance.  Power 
consumption  savings  >  10X. 

Technical  Challenges 

□  Develop  cell  library  and  SRAM 
operating  at  Vdd  <  300  mV 

u  Achieving  adequate  performance 

□  Addressing  high  sensitivity  to 
variations 

Impact: 

□  Dramatic  increase  in  battery- 
lifetime  of  portable  devices 
(sensors,  wireless 
communications,  signal 
processing,  etc.) 
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Key  Technical  Challenges 


SRAM  is  a  critical  component  in  current  digital  systems 

□  Scaling  SRAM  to  0,3V  requires  new  circuits  and  architectures 

□  leakage  power  is  critical  in  low-duty  cycle  application 


Device  variability  in  logic  and  SRAM  circuits 
o  Design  modeling  and  architecture  to  deal  with  an  order  of 
magnitude  increase  in  variability 

DC -DC  converter  design  that 

o  delivers  microamps  currents  at  high  efficiency  (>  80%) 

□  minimizes  energy  dissipation  of  digital  circuit 

High  performance  in  sub-threshold  operation 

□  Use  of  parallelism  to  mitigate  performance  loss 


mr 

Key  Accomplishments 

Sub-threshold  digital  library  developed  (for  ASICs)  in  65-nm  CMOS 
p  62  cells,  Vdd  <  200  mV.  (0  -  70°C}t  process  variation  tolerant 

□  Integrated  into  commercial  design  flow 
a  FIR  filter  (demonstrated) 

Sub-VT  101  and  8T  SRAM  in  65-nm  CMOS 
o  Sense  amplifier  redundancy  (ST) 
o  V0D  =  35Q  mV  with  data  retention  <  300mV 

UWS  radio  baseband  processor  at  Vdd  *  400  mV 
n  lOGMb/s  @  2  mW  (20pJ/bit),  90-nm  CMOS  process 

□  Parallelism  employed  (20X) :  25  MHz,  4  matched  filters 

UWB  Analog- to-Digital  Converter 
a  SOOMsamples/see,  65-nm  CMOS  process 
o  Parallelism  of  36  converters  enables  sub-threshold  biasing 

Minimum  energy  tracking  loop  with  embedded  dc-dc  converter 

□  DC-DC  converter  delivers  voltages  down  to  250  mV  in  65nm  CMOS 

a  50-100%  energy  savings  by  tracking  &  adjusting  minimum  energy  point  of  operation 

Preliminary  demonstration  of  fully  integrated  sub-VT  DC-DC  converter 


IHjj*  Sub-Threshold  ICs  Under  DARPA  ESE 


[I5SCC05  -  pre-ESE  seedling!  NSSCC06] 


[ICASSPQ6  and  ISSCC07}  [ISLPED06] 

Afcat —  u  u. 


[1SSCC07] 


256 hb  8-T  SRAM 
65nm 

with  Redundancy 


[VLSI  Symposium  06] 


500Ms/s  ADC 
Using  36-paratlef 
Channels 
(65nm) 


[ISSCC071 


JISSCC07  ■  FCRP] 


[ISSCC07  -  FCRP] 
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Cel!  Library  Design 


Sizing  transistors  in  each  cell  considering 
energy  and  variation 


Functional  and  mismatch  simulations  at 
<30OmV,  worst- case  comer 


*  Verifying  robustness  through  Monte-Carlo 
simulation 

■  Cell  layout  and  characterization 


List  of  Standard  Cells 


LNV 

ADDF 

NAND2 

ADDH 

NOR2 

NAND3 

NAND2S 

NOR3 

NOR2B 

OFF 

AND  2 

OFF  sync  reset 

OR2 

OFF  a  sync  reset 

MUX2 

OFF  sync  preset 

MUXE2 

Latch 

AOIB21 

OAIB21 

XOR2 

XNOR2 
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300mV  Digital  Library  Developed 


Goals:  Mitigate  variation 


iSSrifTl —  1 

Sub-VT  Library 
Test  Chip 
65nm  CMOS 

[j&mn  rrrrrr  lu  • 


■  Demonstrated  a  library  that 
operates  <300mV 

■  Includes  62  standard  cells 


Enable  deep  voltage 
scaling 


f  f 


„  0[*1 

_  0[7] 

_  0(8] 
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Problems  with  6-T  SRAM  in  Sub-VT  (JJg) 


Problem  #1 


Problem  2 


Feedback  too  strong: 

Cannot  write  new  tiara  I! 


Bit  line  leakage  Impacts  read  value: 

Cannot  read  corrective 


Problem  3 


Static  Noise  Margin  (SNM)  degraded  by  variation:  j 

Cannot  bold  data  during  read?'  J 

Lowest  previous  demonstrated  SRAM  in  65nm  is  QJV 


|i|ir  Sub-threshold  SRAM 


t  L 


:  u  b-  \ 


8-1  mi! 


Vdow  =  0.7V 


2 56 kb  8-T  SRAM 
65nm  CMOS 
with  Redundancy 


Lowest  Operating  Voltage  SRAM  (<350rp  V’i  in  65-nm  CMOS  Demons:;atec 
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Motivation  for  Energy 
Minimizing  Loop 

4.2 
3,6 

3 

2.4 

1.8 

1.2 
0.6 

0 


Bot 

6X 

*  Beak 

/ 

Eact 

v 

•r  j 

*  L 

■**  *  *  *  *  f 

7-tap  FIR  filter 
MEP“400mV 

6X  savings  in  energy  obtained 
by  operating  at  the  MEP 
compared  to  the  nominal  voltage 
of  1,2V 


0,2 


0J 


1.2 


MEP  moves  with  change  tn 
workload  -  no,  of  taps  of  the  FIR 

A  further  2.2X  improvement  in 
energy  consumed  can  be 
obtained  by  tracking  the  MEP 
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Minimum  Energy  Tracking  Loop  with 
Embedded  DC-DC  Converter 


■  Feedback  Circuit  ‘“Minimizes"  energy  of  digital 
Logic  in  65-nm  CMOS 

m  ISSCC  2007  Beatrice  Winner  Editorial  Award 
(February  2007,  San  Francisco) 
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Application:  UWB  Impulse  Signaling 


Real-tim  e  FPGA 
Based  Environment 


■  Example  system  shown  uses  off-the-shelf  components  {FPGAs} 
and  demonstrates  lOOMbs  UWS  link  using  MBOA  and  pulses 


Platform  for  Channel  Accurate  Algorithm  Testing 
(Funding  through  NSF ,  ARLt  and  HP) 
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Highiy-Parallel  UWB  ADC 


■  Maximally  parallel  ADC  for 
UWB  applications 

□  36  parallel  channels 

□  SOOmV  channel  operation  at 

5G0MSample/s  N 

□  Fully  sub-threshold  analog  ^ 

biasing  3 

Optimum  mixed-signal  energy  mode! 


v  3x  energy 
savings  far 
low-voltage 
operation 


✓  6  redundant  channels  counteract 
yield  loss  from  local  variation  in 
deep-submicron  CMOS  is 


£  7 


J0  2D  40 
M  (#  interleaved  channels) 
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UWB  Radio  Baseband  Processor  at 
Vdd  ~  400  mV  Using  Extreme  Parallelism 


High  Throughput:  100  Mbps  throughput  @  2  mW  (20pJ/bit) 
for  4-kbit  packet 

Reduced  Operating  Frequency:  25  MHz  though  parallelism  - 
620  correiators,  4  matched  filters 
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Research  Visibility 


BETIMES 


“A  paper  from  M IT  may  Introduce  a  whole  new  metric:  lowest  operating  voltage.  By 
Aggressive  use  of  voltage-frequency  scaling,  subthreshold  circuit  operation,  and 
supply  voltage  dithering,  the  team  was  able  to  keep  an  adder  circuit  operating  over 
The  full  ra nge  from  1,1  V  to  under  300  mV.  This  appears  to  be  the  lowest  reported  1 

[Operating  voltage  for  a  digital  circuit  at  the  conference  j 

THE  WALL  STREET  JOURNAL. 

February  7,  2006 

By  DON  CLARK  and  CHARLES  FORELLE 
Intel,  Tl  Chips  Use  Less  Power 

Texas  instruments,  as  part  of  a  project  led  by  researchers  at  the  Massachusetts 
Institute  of  Technology  and  funded  by  the  Pentagon's  Defense  Advanced  Research 
Projects  Agency,  is  using  the  same  generation  of  production  technology  to  create 
a  test  memory  chip  that  sets  a  record  for  low  voltage  in  such  devices.  Yet  the  0.4- 
volt  chip  is  much  better  at  controlling  unwanted  leakage  of  electrical  current  than 
existing  0,6-volt  chips,  the  company  says.  Such  leakage  is  a  big  contributor  to 
power  consumption. 
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Electronics  Weekly 

*  Ultra-low  power  CMOS  design  explained 
by 

Friday  16  February  2007 

Researchers  at  the  Massachusetts  Ins  titute  of  Technology  (MIT)  have  developed  a  feedback-control 
sc  heme  [hat  interactively  tunes  CMOS  operating  voltage  to  minimise  dissipation. 

Energy  consumption  In  CMOS  drops  quadratic  ally  as  its  Supply  voltage  is  bought  below  its 
threshold  voltage.  However,  according  to  MIT,  leakage  increases  exponentially  at  the  same  time. 

This  means  that  for  any  given  circuit  workload  and  temperature,  there  is  a  particular  supply  voltage 
[hat  trades  capacitive  fosses  with  leakage  in  a  way  mat  minimises  power  consumption. 

The  example  CMOS  load'  in  the  SSnm  MIT  circuit,  fabricated  by  Tl,  is  a  hardware  7-Up  FIR 
filter,  whose  power  supply  comes  from  an  on-chip  DC-DC  converter  capable  of  delivering  250 
-  - V  at  f-10Cp“J  “  *  - - - - 


to  700 mV  * 


OpW  at  over  SO  per  cent  efficiency. 


The  loop  consists  of  an  energy  sensor  and  a  controller  that  moves  the  supply  voltage  slightly 
-  via  the  DC-DC  converter  -  to  see  what  effect  it  has  on  energy  consumption  In  this  way  the 
controller  can  push  the  supply  voltage  in  the  improving-energy  direction  until  it  settles  at  the 
bottom  of  the  power  dip 

Changing  the  7-tap  filter  (at  optimal  voltage]  to  a  1-tap  version  drops  power  by  25  per  cent  at 
constant  voltage,  whereas  feedback  control  achieves  a  cut  of  over  40  per  cent 

In  the  presence  of  leakage  -  added  as  a  IpA  constant  load  to  the  circuit  -  power  would  almost 
triple,  but  the  loop  pullslhis  down  to  an  increase  of  only  30  per  cent. 

With  temperature  increasing  from  0  to  35JC,  the  loop  saves  around  SO  per  cent  of  power 
compared  with  constant  voltage  operation,  claimed  MIT 

The  technique  places  no  burden  on  the  controlled  load'  and  consumes  a  tiny  fraction  of  the 
power  it  saves. 
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Recent  Conference  Publications 


2006  ISSCC 

a  "A  256kb  Sub-threshold  SRAM  in  65-nm  CMOS" 

8-  Calhoun,  A.  Chandrakasan,  MIT 

2006  ICASSR 

o  "An  energy  efficient  sub-threshold  baseband  processor  architecture  for  pulsed 
ultra-wideband  communications'' 

V  Szer  R.  Blazquez,  M.  Bhardwaj.  A,  Chandrakasan,  MIT 

2006  IEEE  Symposium  on  VLSI  Circuits 
a  "A  SOOMS/s  5b  ADC  in  65nm  CMOS” 

B.  Ginsburg,  A,  Chandrakasan 

2006  International  Symposium  on  Low  Power  Electronics  and  Design 

□  "Variation-Driven  Device  Sizing  for  Minimum  Energy  Sub-threshold 
Circuits" 

J,  Kwong,  A,  Chandrakasan 

□  "Sub-threshold  Design:  The  Challenges  of  Minimizing  Circuit  Energy" 
(invited) 

Calhoun.  B.  H.,  A.  Wang.  N.  Verma.  A.  P,  Chandrakasan 

□  2006  SLPED  low  Power  Design  Co  -  "500-MS/s  5-b  ADC  in 

65-nm  CMOS  With  Split  Capacitor  Array" 

B,  P,  Ginsburg  and  A.  P  Chandrakasan 
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Recent  Conference  (cont.) 

■  2007  ISSCC  on  Ultra- low- Power  Electronics  (4  papers  and  1  Design  Contest 
Award) 

□  Minimum  energy  tracking  loop  with  embedded  dc-dc  converter  delivering  voltages  down 
to  250  mV  in  65-nm  CMOS  ' 

V  Ramadass.  A  Chandrakasan  MIT  (ESE  Program) 

O  A  47  p^/pulse  3.1-5GH*  all  digital  UWS  transmitter  in  90-nm  CMOS  ' 

0  WentZl Off,  A  Chandrakasan,  MIT  (FCRP  C2S2  Program) 

□  A  SS-rnn  sub-Vt  SRAM  employing  sense -amplifier  redundancy'* 

N  Verma  A  Chandrakasan,  MIT  (ESE  Program) 

□  A  2.5nJ/b  0.65V  1*5  GHz  Subbanded  UWB  Receiver  in  90-nm  CMOS'1 

F  Lee  A,  Chandrakasan  MIT  [FCRP  C2S2  Program) 

d  ‘  Design  of  an  Ultra-Low  Voltage  UWB  Baseband  Processor" 

.  '>em'*e  5ze,  Anj^r.;  -  -arcr-ir aswin  SSL  ~,Z  \  7  -  le*  » .v<  :  A  v  : 

m  GoMac  2007 

□  "A  Parallel  Energy  Efficient  100Mbps  Ultra-Wideband  Radio  Baseband," 

Ginsburg.  BP.  V  Sze,  A  P.  Chandrakasan,  Government  Microcircuit  Applications  & 
Critical  Technology  Conference  (GOMACTechl.  March  2007 

□  "Ultra-Dynamic  Voltage  Scaling  for  Energy-Starved  Electronics" 

Wang,  A.  IN  Verma.  J.  Kwong,  A.  Chandrakasan,(poster  presentation! 

■  PESO  2007  (June  2007) 

□  Voltage  Scalable  Switched  Capacitor  DC-DC  Converter  for  Uitra-Low-Power 
On-chip  Applications 

Yogesft  K.  Ramadass,  Anantha  P.  Chandrakasan 
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Journal  Papers 


■  Benton  H.  Cafhoun.  Anantha  P.  Chandrakasan,  "Static  Noise 
Margin  Variation  for  Sub-threshold  SRAM  in  65nm  CMOS,"  IEEE 
Journal  of  Solid-State  Circuits,  voL  41,  no,  7,  pp,  1673-1679,  July 
2006. 

■  Benton  H.  Calhoun.  Anantha  P,  Chandrakasan,  “A  256kb  65nm 
Sub-threshold  SRAM  Design  for  Ultra-Low  Voltage  Operation", 
IEEE  Journal  of  Solid-State  Circuits,  pp.  680-688,  March  2007* 

■  B.  P.  Ginsburg  and  A,  P.  Chandrakasan.  "500-MS/s  5-bit  ADC  in  65- 
nm  CMOS  With  Split  Capacitor  Array  DAC,"  IEEE  J.  Solid-State 
Circuits,  vol*  42,  no*  4,  pp.  739-747,  Apr.  2007* 

m  invited  (to  the  special  issue  of  the  JSSC): 

□  Nr  Verma,  A  P*  Chandrakasan,  "A  256kb  65nm  8T  Sub-Threshold  SRAM 
employing  Sense-Amplifier  Redundancy1 

□  R,  Yogesh.  A,  P.  Chandrakasan,  “A  minimum  energy  tracking  loop  with 
embedded  DC-DC  Converter  enabling  ultra-low-voltage  operation  in 
65nm  CMOS” 
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Mir 


Book  on  Sub-threshold  Circuits 


SUB  THRESHOLD 
DESI6N  FOR  ULTRA 
LOW  POWER 
SYSTEMS 


*  A  Direct  output  of  research  from 
the  DARPA  ESE  and  Seedling  efforts 

■  Also  includes  invited  chapters  from 
Eric  Vittoz  (pioneer  of  sub-threshold 
Analog  circuits)  and  Christian  Enz 
(EKV  model) 
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Impact  and  Technology  Transitions 


Boeing  exploring  asynchronous  sub-threshold  logic 
□  Transferred  our  Verilog  design  and  sizing  methodology 
(Joyce  Kwong  and  Vivienne  Sze) 

ISSCC  and  ISLPED  has  many  submissions  on  sub- 
threshold  logic  design 

Implementation  of  sub-threshold  MSP-430  {in 
collaboration  with  Tl)  on-going  with  standard  cell 
library,  sub-VT  SRAM  and  DC-DC  converters 
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Schedule  and  Milestones 

WB  .-.  u«rdjn 


9/1/05 

(program  Start) 


12/1/05 


3/1/06 


6/1/06 

t 


9/1/06  11/30/06 


•New  result:  preliminary  design  of  fully  on-chip  DC-DC  converter 
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Dense  Sub-Vt  SRAM  For  Ultra-Low  Leakage 
Power  and  Access  Energy 

PI:  Anantha  Chandrakasan 

(  ) 


Students :  Naveen  Verma,  Benton  Highsmith  Calhoun 

Tl  Collaborators:  Dennis  Buss,  Terence  Breedijk,  Uming  Ko,  David 
Scott,  Dr.  Alice  Wang 


Mir 

Energy  Minimization 

Minimum  energy  VDD  for  logic  results  from  opposing 
active  and  leakage  components 


Simulation 

ofCLA 

adder 


Eact-CVdd2 


E-leak  ~  J  ^  leak  v  do 


Vnndt 


Op 


0.*  . ,  o,a  „  oa  1  I  z 

Voo  or) 

•  SRAMs  remain  “on”  to  retain  data:  minimum  energy 


VDD  is  lowest  functional  V 


DD 


-  DIBL  reduces  lLEAK  by  5x  from  IV  to  300mV 


Voltage  scaling  gives  power  savings  of>15x 


Outline 
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■  Low-voltage  SRAM  challenges 

■  10T  sub-Vt  SRAM 

□  Sit-cell  &  peripheral  assists 

□  Test-chip 

■  8T  sub-Vt  SRAM 

□  Bit-cell  &  peripheral  assists 

□  Sense-amplifier  redundancy 

□  Test-chip 

■  Conclusions 
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Sub-Vt  MOSFET  Characteristic 


in  Sub-V,: 

1)  Device  strength  varies  exponentially  with  Vt 

2)  Iq^I0FF!s  severely  degraded 


In 


SNM  during  Hold  and  Read 


Read  SNM  is  worst-case 


pur 


Write  Failures 


W  rite  failure: 
Positive  write 
margin 


Prior  to  write 


Successful  write; 
Negative  write 
margin 


IMif 


6T  Low  Voltage  Failures 


Relative  device 
strengths  determine 
readabilty/writeability 


IMir 

Read  Current  Distribution 

Array  performance  determined  by  worst-case  lREAD 


In  sub-Vp  reduced  overdrive  lowers  mean  iREAD, 
and  variation  causes  larger  degradation  of  tail  Uead 


Outline 


■  Low-voltage  SRAM  challenges 


10T  sub-Vt  SRAM 

□  Bit-cell  &  peripheral  assists 

□  Test-chip 


■  8T  sub-Vt  SRAM 

□  Bit-cell  &  peripheral  assists 

□  Sense-amplifier  redundancy 

□  Test-chip 


■  Conclusions 


m 


10T  Bit-Cell  Reduces  Bitline 


Leakage 


QbI^ 


QBB  held 
near  l  by 
leakage 


QB=1  - 


RBL=1 


Leakage 
^  reduced 
4  by  stack 
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10T  Bit-Cell  Lowers  Bitline  Leakage 


Steady-state  BL  read  values 


10T  bitcetl  enables  higher  level  of  integration  on  BL 
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10T  Bit-Cell  Allows  Sub-VT  Write 


write 


Floating  VDD 
weakens  feedback  WL-Wn 
and  allows  w  rite 


VO  Don 


Q  and  QB 


VVDD 


\ 


floating 


feedback 
restores  ‘1’ 
to  Vr 


DD 


10T  Bit-Celi  Allows  Sub-VT  Write 
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Test  Chip  Architecture 


•  256  rows  and  128 
columns  per  block 

•  Static  CMOS 
peripherals 

•  Separate  WL  VDD 
for  boosting 

•  Assumed  lxl 
redundancy 

•  Simulation: 


|  WLgiotall  i  BKscl  I 

y  <  / 


x» 


Address- :Q  |f» 


m 


256Kb  65nm  Sub-VT  memory 


Test  chip  addressing  the  sub-VT  problems  using  10T  bitcell: 
1.89mm  by  1.12mm 

Chip  functions  to  below  400mV,  holds  without  error  to  <250mV: 
At  400mV,  3.28nW  and  475kHz  at  27°C 

Reads  without  error  to  320mV  (27°C)  and  360mV  (85°C) 

Write  without  error  to  380mV  (27°C)  and  350mV  (85°C) 


Power  Measurements 


Relative  to  0.6V  6T  SRAM,  2,2X  less  leakage  power  at  0,4V  and 

3  JX  less  leakage  power  at  0,3V 
>60X  less  leakage  power  than  1,2V 


IHjl*  Active  Energy  Savings  with  10T  Bitcell 


i  i ; 


1.2 


■  6T  memories  in  65nm  usually  at  0.9V  or  greater  (lowest  reported 
is  0.7V) 

■  Operating  10T  bitcell  at  lower  voltages  saves  energy 

■  10T  memory  can  provide  high  frequency  operation  at  higher 
voltages  when  necessary 
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Outline 

Low-voltage  SRAM  challenges 
1 1 0T  sub-Vt  SRAM 

□  Bit-cell  &  peripheral  assists 

□  Test-chip 

i ST  sub-Vt  SRAM 

□  Bit-cell  &  peripheral  assists 

□  Sense-amplifier  redundancy 

□  Test-chip 

'Conclusions 


Read-Buffer  Foot-Driver  Limitation 


1 1 1  r  r 


Virtual  Cell  Supply 


Access  devices 
and  supply- 
driver  interact 
to  accurately 


set  W, 


DO 


during  hold 


WDD  settles  to 
low  intermediate 
voltage 


0.3 


>  0.1 


V-~ . 

- ■ 

oj 

: Cl 

. .  QB  ■ 

10 


12 


14  16 
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1)  Global  variation  degrades  sense-amp  accuracy  for 
single-ended  read 

-  Pseudo-differential  structure  eliminates  offset 


2)  Local  variation  results  in  uncorrelated  error 
distribution  of  sense-amps 

•  Only  device  up-sizing  can  reduce  offser  deviation 


jijl,-  Sense-Amplifier  Redundancy 

RDBL  41  |- 

_j _  1)  Ei 

table  only  one  of  N  sense-amps 
-  each  RDBL 

ilarly  applied  to  flash  A-D  [Flynn.  TCAS'03] 

snse-amp  offsets  are  from  local 
iriation  only  fun  correlated) 

REF- - , - 1 -  foi 

\  Aen\  /-ENV  •  S'm 

Y  Y  2)  Si 

Q  Vi 

Total  area  is  constrained;  each  sense-amp  must  be  smaller 

Ap- 

With  redundancy, 
area  of  each  SA 
must  decrease,  and 
its  offset  croes  up. 

- 0 * 

1  -0,05  0  0.05  0. 

Differential  Input  Swing  (V) 

A 

i<t;r 


Sense-Amplifier  Redundancy 


Sense- 4mo  Area 
(N=1)  (N=2)  (N=4)  (N=8) 


Column  Pitch 


0  0.01  0.02  0.03  0.04  0.05 

Input  Voltage  Swing  (|V|) 


tnnnt 


ERR.1 


ERR.  2 


'  ERR,  4 
RERR.8 


b  0.05 
Differential  Input  Swing  (V) 


Probability  of  error 
depends  on  joint 
probability  that  all 
sense-amps  fail: 

EERR,tot“(R ERR,n)N 


Redundancy  Implementation 


Start-up  loop  selects 
between  2  sense - 
amps,  yielding  error 
improvement  of  5x 
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Prototype  SRAM 

1.89mm 


Process 

65nm  CMOS 

Architecture 

8  Blocks  X  256  Rows  X  128  Columns 

Capacity 

256kb 

IN 

350mV 

Miir 


Measured  Leakage  Power 


Data  correctly 
retained  at 
300mV, 

Pleak=1-65vW 


IMif 


Leakage-Power  &  Area  Comparison 


3x  leakage  power  savings  and  30%  area 
overhead  compared  with  6T  ceil 


Active  Performance 


Mur 

Conclusions 

■  Standard  ST  cell  is  ratioed  and  limited  to  ~0.6-0.7V, 
-16  cells  per  bitline 

■  10T  cell  expands  read/write  margins  and  manages  bit¬ 
line  leakage  in  read-buffer  increasing  robustness  to 
sub-Vt  variation 

■  8T  cell  uses  peripheral  assists  to  eliminate  sub-Vt  bit¬ 
line  leakage  and  weaken  local  ceil  feedback  during 
write 

■  Sense-amplifier  redundancy  improves  area-offset 
trade-off  by  factor  of  5 

■  Sub-Vt  bit-cells  and  peripheral  assists  allow  VMIN  to 
350mV  improving  active  energy  and  leakage  power  by 
up  to  factor  of  20  compared  to  conventional  6T  limit 


Sub-threshold  Library  Design 

ESE  DARPA  Program  Final  Review 
April  10,  2007 

Joyce  Kwong 


t'lir 

Outline 

■  Motivation 

■  Results 

a  Logic  design  issues  in  sub-VT 

■  Minimum  energy  operation  given  yield  constraint 

■  Sub-VT  library  test  chip 

■  Algorithmic  error  correction 

■  Next  phase 

□  Timing  analysis 

□  Integrated  sub-VT  system 


Motivation 


(Mir 


■  Facilitate  design  of  sub-threshold  circuits  using 
CAD  tools 

*  Achieve  significant  energy  savings  over  above¬ 
threshold  operation 

■  Study  and  mitigate  effects  of  process  variation 


■  Goals: 

□Custom  cell  library  for  sub-threshold 

•  0.2SV-1 ,2V 

•  0°C-70°C 

•  manufacturing  process  corners 
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IMiT 

Results 

■  Cell  library  and  methodology 
validated  with  test  chip 

■  Algorithmic  error  correction 
logic  demonstrated 

■  Sizing  approach  published  at 
ISLPED  2006 


■  Library  to  be  used  in  next 
generation  sub-VT 
microcontroller 


16-bit  FIR  filter 

t  v1, 

10k  gates 

J*- 


Picu.  1 
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Process  Variation 


VT  varies  between  transistors  on 
the  same  die 

□  Modeled  as  Gaussian  distribution 


Vi 


0.4 

■  0. 


Delay  Distribution  at  a^bit  Adder 


Energy  Distribution  ol  0-bit  Adder 


I 


1 


Sub-VT  (0.3V) 


1 0.2 
l  0.1 


1.2  1.4  1.6 


5,0.3' 

? 

"  0.1f 


Above-VT  (1.2V) 


£02 

L 


1  2  3 

Delay  Normalised  la  Sample  Mean 


06  1  1.2  14  16 

EnergyfAddition  Normalized  no  Sampla  Mean  ^ 
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Sub-VT  Logic  Design  issues 

at  WS  comer (V) 


comer  (V) 


VT  variation  causes  distribution  of  ‘high' 
and  ‘low’  voltage  levels  of  a  logic  gate 

What  V0L,  V0H  levels  are 
acceptable? 


Mill* 


Output  Swing  Metrics 


I'lif 

Output  Swing  Metrics 

■  Need  a  consistent  way  to  measure  logic  failure  due  to 
poor  output  voltage  levels 

■  What  is  considered  "poor”  output  voltage? 


Check  voltage  levels  using 
butterfly  plots 
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Output  Swing  Failure  Rate 


■  Put  logic  gate  under  test  back-to-back  with: 

□  NAND3  to  check  V0H 

□  NOR3  to  check  V0L 

■  Define  failure  =  no  enclosed  square  in  butterfly  plots 


■  Failure  rate  decreases  exponentially  with  device  width 
and  V0D 


Mur 

Constant-Yield  Device  Sizing 

Find  failure  rate  vs.  width 
plots  for  different  circuit 
primitives 

Keep  device  sizes  as 
small  as  possible,  subject 
to  yield  constraint 
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Minimum  Energy  Operating  Point 


■  Optimum  VDD  to  minimize 
energy/operation 

Assumes  full  functionality 


at  all  V 


oo 


10 


I 
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Calhoun  &  Chartdrakasan, 
"Characterizing  and  Modeling 
Minimum  Energy  Operation  for 
Subthresh  old  Circuits," 
1SLPED,  2004 
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Energy-Yield  Trade-off 


Treat  Ceff,  Wsff  as  functions  of  VD0 


Vod(V) 
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—  Const,  yield  sizing 
Min,  sizing 
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Minimum  Energy  with  Yield 
Constraint 

Compare  energy  of  inverter  chain 
with  minimum  size  and  constant- 
yield  sizing 


upsize  below  0.3V 


m  Optimum:  minimum  size  circuit 
at  0.3V 


0.25  0,3 


0,3  0.35  0 

VD0<V> 


0.45 


Min 


Minimum  Energy  with  Yield 
Constraint 


Example  2:  32-bit  Kogge  Stone  Adder 


Case  1:  if  min.  size  circuit 
reaches  minimum  energy 
before  shaded  area 

□  no  upsizing  necessary 


Case  2:  if  min.  size  circuit 
has  minimum  point  within 
shaded  area 
□  upsize  to  achieve 
minimum  energy  while 
satisfying  yield 
constraint 


0.45 
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Energy  Variability 

■  Given  the  same  yield  constraint: 

□  Upsized  adder  has  smaller  mean  leakage  current 
and  total  energy 

□  Min.  size  adder  has  smaller  energy  spread 


Hiir 


Sub-VT  Library  Test  Chip 


■  Synthesized  from  sub-VT 
iibrary  using  commercial 
CAD  tools 

■  Demonstrates  16-bit  FIR 
filter  with  error  correction 
feature 

■  Silicon  verified  to  work 
<300mV 

_ 

.  tfT** 

4  D[S] 

_ ‘  _  _  .  D[7] 


D[ai 


I'lii 


Sub-VT  Library  Test  Chip 


■  Measured  results 
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Error  Correction 


■  Encode  system  states  into  redundant  state 

■  Detect  hard  and  soft  errors 


■  Correct  soft  errors  by  adjusting  clock  frequency 


T9 
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Error  Correction  Demonstration 

g 

C2  ;rj  OrfaTiw  IWBfr* 

e *#m 

n  oo  £2  ao  c«e«  oo 

clock 

frequency 

decreased 
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Outline 


I'll? 


c 


i  T 


ft  ;  *  1  r?*  error  -^rr^c  or 

■  Next  phase 

□  Timing  analysis 

□  Integrated  sub-VT  system 
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Timing  Analysis 

*  Traditional  STA  uses  deterministic  best/worst  case  delay 
■  Leads  to  over-design  when  delay  spread  is  large 
a  Consider  statistical  averaging  effects 
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Register  Hold  Time 


2»r 

1 

1 

200 

Register  hold  time 
variability  (ct/u) 

f,w 

|  too 

□  depends  on  clock/data 

slew  rates 

50 

□  differs  between  registers 

tgrt- _ 

No  easy  way  to  predict 

IjeSJ 

hold  time  of  a  given 

too 

register! 

a  ao 

|  60 

i  «|  | 

20l 

D-latch 


-o,oa  -0.04  -002 

Latch  hold  time  [us! 


D-register 
with  reset 


0,005  0.01  0,01  S 
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Possible  Approaches 

Model  as  canonical  distributions 

□  makes  computation  easier 

□  accurate  modeling  of 
distribution  tail  is  critical 


Efficient  simulation  approach 
□  simulate  paths  most  likely  to 
fail 

D  coarsely  discretize  delay 
distribution 
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Integrated  Sub-VT  System 


■  Integrated  system  operating  from  external 
battery 

•  Sub-VT  SRAM  critical  to  reducing  leakage 

■  Microcontroller  supports  low  power  modes 


IMu 


Sub-VT  Microcontroller  (MSP430) 


■  T!  MSP430 

□  16-bit  RISC 

□  16  registers,  7  addressing  modes,  27  instructions 

□  applications;  utility  metering,  security,  portable 


Conclusions 


■  Sub-VT  offers  drastic  energy  savings 

■  Sensitivity  to  variation  must  be  mitigated 

□  device  sizing 

□  architecture 

■  Minimum  energy  point  changes  due  to  yield 
constraint 

□  upsizing  in  deep-sub-VT  is  still  advantageous 

■  Accurate  timing  analysis  is  critical  to  building 
complex  sub-VT  systems 
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Minimum  Energy  Tracking  Loop 
with  Embedded  DC-DC  Converter 
Delivering  Voltages  Down  to  250mV 
in  65nm  CMOS 


Yogesh  K.  Ramadass  and  Anantha  P.  Chandrakasan 
Massachusetts  Institute  of  Technology 


Mir 

Micro-Power  Applications 

Wireless-sensor 

Networks 


Medical 

Devices 


Ambient 
Intelligence,  RFID 


■  Emerging  energy-constrained  applications 

■  Increase  battery  life-time  through  system-level 
energy  management  techniques 

■  Energy  scavenging  possible  -  system  power 
<10pW 


Outline 
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■  Motivation  and  System  Architecture 

■  Energy  Sensing  Technique 

■  Low  Power  DC-DC  Converter 

■  Measurement  Results 

■  Conclusions 
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Sub-threshold  Operation 

j  Strong 


10-a 


'fast,"High-energy 


jub-threshoid  Operation 
slower,  minimum  energy 


put 


Inversion  Operation: 


0.2  0.4  O.e  0.3 

VDD  (Normalized) 


Goal:  Minimize  Energy  per  Operation 
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Minimum  Energy  Point  (MEP) 


VDD(V) 


jijjf  Motivation  -  Minimum  Energy  Tracking 


VDD  (V) 


■  Minimum  Energy  Point  (MEP)  varies  with  workload  and 
temperature 

■  MEP  moves  when  ratio  of  active  to  leakage  energy 
changes 

■  Tracking  the  MEP  :  0.5X  -  1 .5X  energy  savings 
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Calculating  Vn  -  V2 


1.  Sample  V1  across  C1 

c, 


Storage  Capacitor 
(Cload) 


V, 

AFTER  N 
OPERATIONS 


2.  Sample  V2  across  C2 

c, 

V, 


C, 


3.  Drain  C., 
to  V2 

Count  no.  of 


v. 


_ 

Kcc(V'-V') 


K 


clock  cycles J  |J  |J  |  |  || 


|i|jf  Minimum  Energy  Tracking  Algorithm  ^ Q) 


■  Uses  a  slope  tracking  algorithm 

■  Starting  Vref)  initial  direction  can  be  set  by  the 
user 


|i|jf  Minimum  Energy  Tracking  Algorithm 


■  If  new  Eop  is  smaller,  continue  incrementing  Vref 

■  Else,  change  direction  and  decrement  Vref  until 
minimum  is  achieved 


|i|ij’  Minimum  Energy  Tracking  Algorithm 


■  The  new  Eop  is  smaller,  hence  the  loop 
continues  to  decrement  Vref 

■  One  more  computation  is  required  before  the 

loop  settles  to  the  minimum  energy  13 


Min 


Minimum  Energy  Tracking  Algorithm 


■Op 


1.  Calculate  Eop  again 

2.  Compare 


The  new  Eop  is  higher,  so  the  loop  reverses 
direction  and  increments  Vref  one  last  time 
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Minimum  Energy  Tracking  Algorithm 


Vref  is  set  at  the  minimum  energy  point 

The  loop  shuts  down 

The  load  circuit  continues  to  operate 


15 


niir 

DC-DC  Converter  Architecture 

AVrwf _ 

(from  DAC) 


9  SAT 


(1,2  V) 


Fixed 

Pulse  Width 
Generator 


Divider  and 
Level  Converter 

- 1 - 


Variable 
u  Pulse  Width 
Generator 


1 


L 


-K>h^ 


'DDrvw 


load 


Off-chip 


VDD  :  250mV  -  700mV  ;  Load  Power  :  IpW  -  lOOpW 

Converter  operates  in  Pulse  Frequency  Modulation 
(PFM)  mode 

Vref  is  set  digitally  by  the  loop 
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Techniques  to  Improve  Efficiency 


>nductton,  Switchin< 


□  Optimal  Power  Transistor  Sizing 

Control  Loss  (affects  low  load  efficiency) 

□  Simple  PFM  Control 

□  Ail-digital  control  to  achieve  approximate  ZCS 

□  Comparator  clock  scales  with  load  power 


L,  a  n  • 


□  Finite  delay  between  PMOS  and  NMOS  pulses 

□  Increase  E!oad,  minimize  contribution  of  Ec 


"par 
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Test  Load  -  FIR  Filter 

CLKV 


ar 


7-tap  FIR  filter  capable  of  operation  down  to  250mV 
Workload  varied  by  changing  the  number  of  taps 
Leakage  remains  constant  as  number  of  taps  are  changed 
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Test  Chip  Die  Photo 


■  65nm,6LM  CMOS 

■  Die  area  - 
1.05mm  x  1.12mm 

■  Circuit  active  area 
-  0.23  mm2 

■  Minimum  energy 
tracking  circuitry 
occupies  just 

0.05mm2 
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DC-DC  Converter  Efficiency 


>80%  efficiency  while  delivering  IpW  load  power 


i'lif 

Measured  Energy  Savings 

■  MEP  increases  on  decreasing  workload  - 
1 .1 X  energy  savings 

■  MEP  increases  with  increase  in  temperature  - 

0.5X  energy  savings  24 
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MEP  Tracking  Loop  Properties 


Tracking  Loop 

□  Non-invasive  tracking 

□  Energy  computed  of  the  actual  circuit  -  no  replicas 

(Tracking  Methodology 

□  Independent  of  the  size  of  the  load  circuit 

□  Independent  of  the  DC-DC  converter  topology 

i  Overhead 

□  Energy  overhead  -  50  operations 

□  Area  overhead  =  0.05mm2 

a  Multiple  loops  -  distinct  voltage  domains 
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IMiT 

Switched  Cap  Architecture 

DAC 


> 


COMP, 


Non-Overlapping 
Clock  Generator 


A  I  A 

*  I 

1  an W2~ 


Oj _ 

Clk4X 


AUTOMATIC 

FREQUENCY 

SCALER 


enW4 


CO  , 

1ML 


SWITCH 

MATRIX 


-elk 


ON-OFF  mode  control.  Vref  is  set  digitally 

No  static  power  loss 

Completely  on-chip  except  for  C,oad 
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Die  Photo 


i  Die  area  -  1.6mm 
x  1.6mm 

i  Circuit  active 
area  -  0.57  mm2 

i  Gate-oxide 
capacitors  are 
used 
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I'll"  Conclusions 


■  Control  loop  tracks  minimum  energy  voltage  of 
arbitrary  digital  circuits 

■  On-chip  energy  sensor  circuitry  has  very  low 
energy  and  area  overhead 

■  Low  power  DC-DC  converter  achieves  >80% 
efficiency  at  IpW  load  power 

■  A  preliminary  version  of  a  switched  capacitor  DC- 
DC  converter  has  been  implemented 


Acknowledgements;  Funding  provided  by  DARPA, 

Chip  fabrication  provided  by  Texas  Instruments 
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A  Parallel  Energy  Efficient  100  Mbps  Ultra- 
Wideband  Radio  Baseband 


Brian  P.  Ginsburg,  Vivienne  Sze,  and  Anantba 
Chandrakasan 

Massachusetts  Institute  of  Technology 


MEiT 

Table  of  Contents 

■  System  Motivation 

■  Time-Interleaved  ADC 

■  Digital  baseband  processor 

■  Mixed-signal  optimum  energy  point 

■  Conclusion 
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Ultra-Wideband  Radio 


FCC  Specifications 

*  > 500MHz  lOdB  bandwidth 

*  Very  low  average  power  density 

FCC  Spectrum  Mask  and  14-Channel 
Frequency  Plan 
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I 

0Q 

0. 

QC 
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-61,3 


High  Data  Rate 

•  MB-OFDM  extends  weil-known 
802.1  la/g  technique  to  the  UWB 
bandwidths  for  up  480Mb/s. 

*  DS-UWB  is  pulse-based 
communication  at  up  to  2Gb/s 

-  Short  distances  (<10m) 


3.1 


10.6 


Freqtiency  (GHz) 


IMif 

UWB  Receiver  Design 

/ 


BPSK- 

modulated 

Gaussian 

pulses 


Tj 


LNA 


Pulses 
separated  by 
10ns  during 
payload 
lOOMb/s  peak 
data  rate 


Performs  timing  acquistion, 
channel  estimation,  and  data 
demodulation 
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ADC  Parallelism:  Voltage  or  Time 


Flash  ADC 


resolution 


Time-interleaving 


SAR  channel: 
linear  growth 
in  complexity, 
but  long 
latency 


/ 


HMtI 


-X 


*ali 


Time  interleaving  suffers  from 
additional  sources  of  distortion: 

•  Timing  skew 

•  Gain  mismatch 

•  Offset  mismatch 
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ADC  Measurements 


720  fJ/conv.  step 


-  Tl  65nm  CMOS  process  6 
•INL<0.2t  DNL<0.25  LSB 

*  ENOB  =  4,5  (DC),  4 
(Nyquist) 

-  Complete  results  in 
[Ginsburg,  VLSI  2006] 


FFT  of  near-Nyquist  input 


440  fj/conv.  step 


50  100  150  200 

Frequency  (MHz) 


250 


125  250  500 

Sampling  frequency  (MHz) 
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UWB  Packet  Structure 


mr 


PREAMBLE 


|  PAYLOAD 


Receiver 
Turns  ON 


Receiver 
Turns  OFF 


■  Goa!  ;  Reduce  energy  spent  during  acquisition  (overhead) 

■  Majority  of  acquisition  energy  spent  on  computation  ofcross* 
correiation 
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Aggressive  Voltage  Scaling 


Correlator  Architecture 


PN  Sequence 


■  Correlators  compute  the  cross-correlation  function 

■  Voltage  scaling  to  reduce  energy  per  operation 

■  Parallelize  to  maintain  throughput  of  500  MS/s 

■  Designed  and  simulated  in  a  90-nm  process 


IHm  Operate  Near  Minimum  Energy  Point 


At  the  minimum  energy  point  of  0,3  V 
9X  energy  reduction 

Set  clock  frequency  to  25  MHz 
{preamble  PRF) 

Parallelize  by  20  to  maintain  500  MS/s 
throughput 

Need  to  raise  voltage  to  0.4  V  to 
achieve  25  MHz 

At  0,4  V.  reduce  energy  per  operation 
by  almost  6X 
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Energy-Area  Tradeoff 


Energy-Area  T radeoff  for  Digital  Baseband  Processor 


IMir 

Baseband  Architecture 
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400mV  Baseband  Processor 


■  ST  90-nm  CMOS  process 

■  281k  gates 

•  Includes  620  correlators  &  4  matched  filters 
9  Die  area:  10.94mm:  (Active  area  23%) 


Data  Ready 
Output  Data  [1-0] 
Output  Clock 


Oscilloscope  plot  shows  correct  functionality  at  400m V  @  25  MHz 


Mlir 

Energy  Per  Bit 

Energy  Per  Bit 

512b  ikb  2  kb  4kb  Bk& 

Size  of  Packet  (bits) 
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ji|j-  Combined  Mixed-Signal  Optimum  Energy?  ^ 


Increase  the 
number  of 
parallel  ADCs, 
with  each  one 
driving  a 
correlator  bank 

i 

Slower  ADCs 


•  Digits!  power  in  ADC  directly  benefits  from  voltage  scaling 
and  increased  parallelism 

■  Rebias  analog  circuits  for  improved  g„,/lD 

•  Sampling/clock  distribution  limited 


iMir 

SAR  Energy  Model 
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Optimum  Energy  Point 


93  channels 


Hlii" 

Capacitor  Array  Sizing 

Increasing  parallelism 


2,5x  capacitor  size  penalty/Channel  for  36x  parallelism 


Pin 


Linearity  Calibration 


Apply  standard  SAR 
calibration  technique 
[Lee  JSSC  12/84] 

Fixes  INUDNUoffset 
but  with  additional  per- 
channel  complexity 


Q:  What  about  preamplifier  gain/bandwidth?  Digital 
propagation  delays?  Residual  timing  skew?  Increasing 
variation  in  deep-submicron  CMOS? 

A;  Redundancy 


Apply  redundancy  at  maximum  level  of  parallelism: 
the  channel 
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Power  of  Redundancy 

Behavioral  Simulations 


6  redundant  channels  (17%  overhead)  - 
2x  capacitor  size  reduction 
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36  (+6)  Channel  ADC  Testchip 


Balanced  trees  for 
input  and 
sampling  network 
signals 


DFT 

-  Separate  debug 
bus  for  individual 
channel  measure¬ 
ments 

*  Programmed  with 
714  bit  configura¬ 
tion  register 
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JUUL 

i _ i_  _ 

2.65mm 

Texas  instruments  65nm  CMOS 


Min" 

Basic  Performance 

■  Individual  channels  operational  at  0.8V  to  600MS/S  overall 
sampling  rate 

■  Interleaved  results  limited  by  timing  of  output  mux  at  over 
400  MS/s. 


FFT  of  190MHz  input  at400MS/s 


Power  at  2  50  MS/s 

1.24mA  at  0.8V 

0.4mA  at  1,2V 
(sampling) 

1.46mW  total 

(280f J/conv,  step) 


frequency  (MHz) 
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Measured  Channel  Variation 


I'liF 


Per-channel  measurements 


INL  =  0.50  0.31  LSB 


OSpp  =  2.8  2.0  LSB 


Ml  if 

Redundancy  SNDR  Improvement 
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Redundancy  SNDR  Improvement 


I'lif 

Conclusions 

Demonstrated  energy  efficient  high-performance  UWB 
baseband 

Time-interleaving  permits  slower  but  more  efficient  ADC 
architectures 

Digital  energy  minimized  at  ultra-low-voltages;  parallelism 
with  near-zero  overhead  takes  full  advantage  of  reduced 
supplies. 

Parallelism  yields  significant  energy  savings  in  mixed- 
signal  circuits,  particularly  for  those  with  significant  analog 
and  digital  complexity. 

Redundancy  is  an  incredibly  powerful  tool  to  achieve  the 
full  benefit  of  advanced  technologies. 


Acknowledgments:  DARPA.  NDSEG  Fellowship,  and  NSERC 
Fellowship  for  funding 
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Ultra-Wideband  Signaling 
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Narrowband  Signal 


Spectrum 


Impulse-UWB  Signal 


1  "-N 


Frequency 


■  FCC  defines  UWB  as  bandwidth  >500MHz 


UWB  signals  are  narrow  In  time 
Energy  spread  over  wide  bandwidth 
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Specifications 
AM-digital  transmitter 
Energy  detection  receiver 
System  integration 
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■  Low-data  rate,  energy-constrained  applications 


~  1M 

3 

0.1  |j 

3 

^  lOn 
m 

|  In 
m  0.1  n 

Ik  10k  0.1  M  1M  10M  0.1G  1G 
Data  rate  [b/s] 


1 

„  1 

- )  1  1 - 1 - 

*•;  !  ;  [isscc] 

K  1  i  ’ 

*  1  #  i 

_ L  *1  -  „  fi  .  _ 

_ 

_ 

_ 

i  m  i  r 

1  *  1  - 

L _ 

1  l  1 

1  l  1 

•s 

Trend: 
Data  rateT 
Energy/bit  A 


■  Pulsed-UWB  signaling  inherently  duty-cycled 


TX  and  RX  on  only  when  a 
pulse  is  present 


Fast  (2ns)  turn-on  time 
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System  Specifications 
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PPM  signaling  with  non-coherent  receiver 
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Pulse  Generation  Principle 
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*  Use  a  tapped  variable  delay  line  and  edge 
combiner  to  synthesize  a  pulse 


Center  frequency 
depends  on  delay 


Width  depends  on  number 
of  edges  combined 
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|  Frequency  selectivity  without  LO 
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Delay  Range  and  Accuracy 
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Measured  RF  Output 
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Low  frequency  calibration  algorithm 
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|l|jf  Delay-Based  BPSK  Scrambling  F^.Rp 
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3-Channel  Spectrum 
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0.5V-0.65V  LNA 
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Adjustable  channel 
select  filtering: 
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|l|jr  Passive  Self-Mixer 
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Baseband  Demodulator 
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■  Uses  parallelism  to  increase  throughput 

■  Switched-capacitor  circuits 


Al!  circuits  operate  at  500mV 
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Measurement  Results 
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Performance  Summary 
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Transmitter  Receiver 

Die  area 

0.2x0.4mm2 

1.0x2.2mm2 

V0D 

1.0  V 

0.5-0, 65V 

Leakage 

96pW 

3.5pW 

Power 

0.72mW 

41*8mW 

Energy/bit 

16.7Mb/s 

43  p]/ bit 

2.5nJ/bit 

90nm  CMOS 


[F.  Lee,  ISSCC2007] 

1mm 

i - 
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[D.  Wentzloff,  ISSCC2007] 


*  Achieves  3-channels  in  3.1-5GHz  UWB  band 

a  Architectures  use  digital  techniques  to  reduce 
power  (interleaving;  parallelism;  stacking) 

•  At  low  data  rates,  power  dominated  by  leakage 
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System  Integration 
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Receiver 


All  synchronization 
performed  in  FPGA 


UWB 

antenna 


Transmitter 


Powered  from 


USB  bus 

Pulse  spectrum 
digitally  calibrated 


Achieved  a  16.7Mb/s  wireless  link 
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Conclusions 

fCrp 

9  All-digital  transmitter 

•  Benefits  from  scaling,,  low-power  digital  techniques 

*  No  analog  biases  required 


■  Energy-detection  receiver 

•  Architecture  leverages  digital  techniques 

•  2ns  startup  time  for  deep  duty-cycling 

9  UWB  radios  can  exploit  available  bandwidth 
9  Low-voltage  circuits  can  further  reduce  power 

Constant  2.5nJ/bit  from  lOkb/s  to  16.7Mb/s 
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34.4  A  256kb  Sub-threshold  SRAM  in  65nm  CMOS 

Benton  H.  Calhoun,  Anantha  Chandrakasan 

Massachusetts  Institute  of  Technology,  Cambridge,  MA 

Low-voltage  sub-threshold  operation  has  proven  to  minimize  energy 
per  operation  for  logic  [1],  and  sub- threshold  systems  will  require 
memories  that  function  at  the  same  low  voltages.  In  this  paper,  a 
65 nm  SRAM  that  functions  into  the  sub-threshold  region  and  exam¬ 
ines  the  impact  oF  process  variation  for  low-voltage  operation  is 
described. 

Previous  efforts  to  reduce  SRAM  power  have  included  voltage  scaling 
to  the  edge  ofsub-th  res  hold  [2]  or  into  the  sub-threshold  region  [31,  but 
only  for  idle  cells.  Although  some  published  SRAMs  operate  at  the 
edge  of  sub-threshold,  none  function  at  sub-threshold  supply  voltages 
compatible  with  logic  operating  at  the  minimum  energy  point.  The 
0.18pm  memory  in  [4]  provides  one  exception.  Consisting  of  latches 
and  using  MUX-based  read  (I ST-equi valent  bitcell  )T  it  operates  to 
180  mV. 

Traditional  6T  SRAMs  face  many  challenges  in  deep  submicron  (DSM) 
technologies  for  low  Vpp  operation.  Predictions  in  [5]  suggest  that 
process  variations  will  limit  standard  90nm  SRAMs  to  around  0,7V 
operation  because  of  static  noise  margin  (SNM)  degradation  and  write 
margin,  and  a  VPI)  of  0,7V  is  reported  for  a  65nm  SRAM  [6]. 
Measurement  results  confirm  that  SNM  degradation  and  inability  to 
write  are  the  two  most  significant  obstacles  to  sub- thresh  old  SRAM 
functionality  in  65nm,  Each  of  these  problems  and  a  bitcell  and  an 
architecture  that  overcomes  them,  are  discussed  in  this  paper. 

Figure  34.4.1  shows  the  impact  of  local  VT  mismatch  on  the  SNM 
for  a  standard  6T  bitcell  in  a  65nm  process.  The  Monte-Carlo 
simulations  show  that  larger  channel  area  decreases  the  spread  of 
SNM  [  O  yj  «  l  *  Jwl  )  and  that  global  variation  shifts  the  distribu¬ 
tion  caused  by  mismatch  [91.  The  Hold  SNM  at  0.3V  has  roughly  the 
same  mean  as  the  Read  SNM  at  0.5V,  However,  the  Hold  SNM  at 
0,3V  roughly  equals  the  6a  Read  SNM  at  0,6V.  Likewise,  the  6o  Hold 
SNM  at  0,4V  and  6ct  Read  SNM  at  0.8V  are  equivalent  Thus,  by  elim¬ 
inating  the  degraded  Read  SNM,  a  bitcell  can  be  operated  at  0.3V  with 
the  same  6a  stability  as  a  6T  bitcell  at  0.6V.  A  7T  cell  avoids  Read 
SNM  for  above- VT  SRAM  [7],  but  the  dynamic  storage  that  it  uses  is 
problematic  for  the  longer  cycle  times  of  sub-VT  operation. 

The  10T  bitcell  in  Fig.  34,4.2  uses  transistors  M7  to  M1Q  to  remove  the 
problem  of  Read  SNM  by  buffering  the  stored  data  during  a  read 
access.  Thus,  the  worst-case  SNM  for  this  bitcell  is  the  Hold  SNM 
related  to  Ml  to  M6t  which  is  the  same  as  the  6T  Hold  SNM  for  same¬ 
sized  Ml  to  M6,  Results  from  18]  show  that  single-ended  read  offers 
competitive  speed  for  the  same  area  efficiency  in  DSM.  This  10T  bit- 
cell  uses  a  full-swing  single-ended  read  that  can  be  "sensed1  using  an 
inverter.  Clearly,  the  extra  FETs  increase  the  area  by  ^66%  and  also 
consume  leakage  power.  MID  significantly  reduces  leakage  power  rel¬ 
ative  to  the  case  where  it  is  excluded.  In  unaccessed  ceils,  M10  pre¬ 
vents  node  QBB  from  pulling  to  '0P  even  when  QB=T.  In  this  technol¬ 
ogy,  the  PMOS  sub-threshold  current  is  stronger  than  NMOS,  so  node 
QBB  floats  dose  to  Vpp  and  decreases  sub-th  res  hold  current  through 
MS,  Also,  when  QB=  0'(  leakage  through  M7  is  reduced  by  the  stack 
that  M10  creates.  Specifically,  for  iso-VDI>T  the  10T  cell  without  M10  (a 
9T  cell)  has  50%  higher  leakage  current  than  the  6T,  but  adding  M10 
drops  the  overhead  to  16%.  This  overhead  in  leakage  current  is  more 
than  compensated  by  decreasing  V0D  by  300 mV  relative  to  the  6T  bit- 
cell.  In  simulation,  the  IDT  bitcell  at  3Q0mV  consumes  2,25*  less  leak¬ 
age  power  than  the  6T  bitcell  at  0  6V  (1,75*  less  relative  to  0.5V), 

The  reduction  in  sub- threshold  leakage  through  M8  reduces  the 
impact  of  leakage  from  un accessed  cells  and  gives  the  additional 
advantage  of  allowing  more  cells  on  a  BL  during  read.  Figure  34.4.3 
shows  the  impact  of  BL  leakage  on  the  steady-state  voltages  while 
reading  a  T  (solid  lines)  or  O’  (dotted  lines).  For  the  same  number  of 
cells  on  a  BL,  the  10T  bitcell  shows  larger  BL  separation  than  the  6T 
(or  9T)  bitcells,  and  "sensing’  with  an  inverter  (whose  switching  thresh¬ 
old,  VM,  is  shown )  works  in  simulation  from  0‘C  to  100 'C  at  all  corners 
for  256  cells  on  a  BL.  For  the  6T  cell  (or  9T),  BL  leakage  limits  the 
number  of  cells  on  a  BL  to  16  at  several  process  corners  for  0.3V,  The 


higher  level  of  integration  allowed  by  the  10T  cell  reduces  the  periph¬ 
eral  circuits  and  slightly  mitigates  the  bitcell  area  overhead.  In  order 
to  combat  the  impact  of  local  VT  mismatch,  the  WL  voltage  is  boosted 
relative  to  the  array  Vpp  by  lOQmV 

Write  functionality  is  the  second  major  obstacle  to  sub-threshold 
SRAM,  as  in  this  65nm  technology,  a  6T  bitcell  cannot  write  in  the  tra¬ 
ditional  fashion  below  0,6V.  The  plot  in  Fig.  34,4,4  shows  the  write 
margin  for  the  6T  cell  under  typical  and  worst-case  process  comer  and 
temperature.  In  both  cases,  the  write  fails  as  evident  by  continued 
bistability  in  the  cell.  Sizing  alone  cannot  correct  this  problem, 
because  the  exponential  dependence  of  sub-threshold  drive  current  on 
VT  overwhelms  the  impact  of  sizing.  To  achieve  write  in  sub-threshold, 
the  virtual  supply  <WpP)  to  the  selected  cells  floats  during  the  write 
operation  (e.g-  [5]),  The  plot  shows  that,  even  for  the  worst-case,  this 
method  provides  ample  negative  noise  margin  for  ensuring  a  write. 
Clearly,  the  side  of  the  bitcell  holding  a  T  is  degraded  in  voltage  due 
to  the  collapsing  virtual  supply.  Figure  34.4, 1  also  shows  the  essential 
timing  required  for  the  write  operation  to  bring  this  value  to  full  VPD. 
The  Wpo  floats  as  VDDon  is  asserted  along  with  WL_WR.  The  crucial 
transition  in  the  diagram  occurs  when  VDDon  goes  low  before 
WL_WR,  allowing  positive  feedback  to  restore  the  "T  to  full  VD0.  In  the 
test  chip,  each  row  contains  a  single  128b  word  that  is  written  at  the 
same  time  and  shares  the  same  Wpp.  The  block  diagram  in  Fig.  34, 4, 4 
shows  how  the  row  is  'folded"  so  that  its  cells  share  a  WDD,  line, 

A  256kb  65nm  test  chip  (Fig.  34.4.7)  uses  the  10T  bitcell  and  the  archi¬ 
tecture  shown  in  Fig.  34.4.5.  The  decoders  and  other  periphery  use 
static  CMOS  logic  for  robust  sub-th  re  shold  operation.  The  entire  array 
functions  at  one  and  the  WL  and  write  drivers  operate  at  lOOmV 
above  that  supply. 

Assuming  one  redundant  row  and  column  are  allocated  per  block,  this 
implementation  of  the  SRAM  functions  to  below  400 mV.  At  400m V,  it 
consumes  3,28pW  and  works  up  to  475kHz.  No  bit  errors  for  holding 
data  occur  in  tlie  SRAM  until  VPD  scales  below  250m V.  Reading  works 
without  error  at  320mV  and  writing  at  3S0mV  at  27FC.  At  85'C.  the 
SRAM  writes  without  error  at  350mV  and  reads  without  error  at 
360mV.  The  measurements  on  the  chip  are  performed  down  to  300m  V 
(Fig  34.4.6  shows  correct  operation),  however  at  this  low  voltage  mis¬ 
match  results  in  bit  errors  in  -1%  of  the  bits.  One  type  of  bit  error 
occurs  when  a  bit  holding  a  T  is  read  as  a  ‘0s  (non-destructive  read). 
This  occurs  along  columns  whose  has  a  high  VM  due  to  mismatch. 
For  rows  whose  MP  is  stronger  due  to  mismatch,  the  write  operation 
fails  to  overpower  MP  sufficiently  to  flip  the  contents  of  the  cell,  even 
when  W0D  is  floating.  Both  of  these  problems  can  be  fixed  by  minor 
changes  to  the  peripheral  circuits,  allowing  further  VDD  reduction. 
Leakage  power  reduction  from  V00  scaling  is  2. 4x  and  3. 8x  relative  to 
0.6V  operation  at  0.4V  and  0.3V,  respectively  (Fig.  34.4,6),  and  active 
energy  savings  are  2,25x  and  4x, 
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Figure  34.4.1;  impact  of  tocal  mismatch  on  ST  SNM  in  65nm,  Read  SNM  has  larger 
standard  deviation.  Hold  SNM  at  0.3V  has  roughly  the  same  mean  as  Read  SNM  ai 
0,5V  and  same  6a  SNM  as  Read  SNM  at  0.6V. 


Figure  34.4.2:  1GT  bitcell  for  sub-threshold  operation.  Removing  Read  SNM  allows 
operation  at  0.3V,  which  leads  to  2.25*  reduction  In  leakage  power. 


Figure  34.4,3:  BL  leakage  limits  the  number  ol  cells  on  a  BL.  The  10T  bitcell  can 
sustain  256  cells/8L  at  0.3V  compared  to  16  without  M10  (6T  or  3T). 


Figure  34.4.5:  Architecture  ol  the  25fikb  tesl  chip. 


Figure  34. 4  6;  Chip  functioned  correctly  to  below  4Q0mV,  Scope  plot  shows  300mV 
operation;  at  this  low  vdlage,  seme  hit  errors  were  observed. 
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ABSTRACT 

This  paper  describes  how  parallelism  in  the  digital  baseband 
processor  can  reduce  the  energy  required  to  receive  ultra- 
wideband  (UWB)  packets.  The  supply  voltage  of  the  digital 
baseband  is  lowered  so  that  the  correlator  operates  near  its 
minimum  energy  point  resulting  in  a  68%  energy  reduction 
across  the  entire  baseband.  This  optimum  supply  voltage 
occurs  below  the  threshold  voltage,  placing  the  circuit  in  the 
sub-threshold  region.  The  correlator  and  the  rest  of  the 
baseband  must  be  parallelized  to  maintain  throughput  at  this 
reduced  voltage.  While  sub-threshold  operation  is 
traditionally  used  for  low  energy*  low  frequency  applications 
such  as  wrist-watches,  this  paper  examines  how  sub- 
threshold  operation  can  be  applied  to  low  energy,  high 
performance  applications.  The  correlators  are  further 
parallelized  for  a  3 lx  reduction  in  the  synchronization  time, 
which  along  with  duly-cycling,  lowers  the  energy  per  packet 
by  43%  for  a  ^00  byte  packet*  Simulation  results  for  a 
1 00Mbps  UWB  baseband  processor  are  described. 


1.  INTRODUCTION 

The  FCC  has  authorized  UWB  wireless  communications  in 
the  3.1GHz  to  J0.6GHz  band  with  a  minimum  bandwidth  of 
500MHz  and  a  maximum  equivalent  isotropic  radiated 
power  spectral  density  of  -41.3dBm/MHz  [1],  IEEE 
working  group  802.15.3a  is  developing  a  high  data  rate 
standard  for  wireless  personal  area  networks  using  UW'B. 

Applications  of  UWB  include  battery-operated  devices 
such  as  mobile  phones,  handheld  devices  and  sensor  nodes. 
Consequently,  there  is  a  strong  demand  for  an  energy 
efficient  UWB  system.  This  paper  will  describe  how 
operating  the  digital  baseband  in  the  sub-threshold  region 
and  increasing  the  degree  of  parallelism  can  translate  into 
energy  savings  across  the  entire  UWB  receiver. 

2.  UWB  SYSTEM  ARCHITECTURE 

The  UWB  packets  are  built  from  a  sequence  of  binary 
phase-shift  keying  pulses  with  a  500MHz  bandwidth.  The 


transmitter  generates  approximate  Gaussian  pulses  and 
upeon verts  the  packet  to  one  of  14  channels  in  the  3. 1  GHz 
to  10.6GHz  hand.  Each  packet,  shown  in  Figure  1,  is 
divided  into  two  sections:  preamble  and  payload.  The 
preamble  contains  multiple  repetitions  of  a  Nc=3 1  hit  Gold 
code  sent  at  a  pulse  repetition  frequency  (PRF)  of  25MHz, 
or  Tprc=40ns*  The  payload  contains  the  actual  data  and  is 
sent  at  a  PRF  of  1 00M  Hz,  or  Tpuy=l0ns,  for  a  lOOMhps  data 
rate  with  no  channel  coding. 


Packet  Begins  PREAMBLE  1  PAYLOAD 
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Fig.  L  UWB  Packet  Format 

The  receiver,  shown  in  Figure  2,  uses  a  direct  conversion 
architecture  in  the  front-end  and  the  in-phase  and  quadrature 
components  are  sampled  at  5O0MSPS  by  two  5-bit  ADCs. 
For  real-time  demodulation  of  the  UWB  packet,  the  digital 
baseband  must  perform  the  signal  processing  with  a 
throughput  of  5G0MSPS.  Synchronization  is  performed 
entirely  in  the  digital  domain*  Only  the  automatic  gain 
control  (AGO  is  fed  back  to  the  analog  domain  so  that  the 
digital  baseband  can  scale  to  lower  geometries.  The 
baseband  was  simulated  using  the  digital  logic  cell  library  of 
a  90-nm  process. 


Fig,  2.  UWB  Receiver 
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3,  DIGITAL  BASEBAND  PROCESSOR 

The  digital  baseband  performs  packet  detection,  acquisition, 
delay  correction  and  channel  estimation  using  the  preamble, 
followed  by  demodulation  of  the  payload.  Additional 
repetitions  in  the  preamble  are  required  for  AGC,  but  will 
not  be  included  in  the  discussion.  Figure  I  outlines  the 
baseband  processor's  four  states  of  operation  with  respect  to 
the  packet. 

In  State  0,  the  acquisition  phase,  the  baseband  detects  the 
presence  of  a  packet  and  provides  an  initial  estimate  of  its 
delay.  This  is  accomplished  by  performing  a  correlation  of 
the  input  with  an  unknown  delay  against  a  3 1 -bit  Gold  code. 
Each  correlation  takes  place  over  T«Kje=NcxTpnt=  1240ns, 
The  delay  must  be  resolved  up  to  2ns  accuracy:  therefore, 
there  are  a  total  of  620  possible  delays  and  corresponding 
correlations:  20  to  match  the  pulse  position,  and  31  to  match 
the  Gold  code.  Until  acquisition  is  achieved,  the  baseband 
remains  in  State  0  and  performs  these  correlations.  When  a 
correlation  exceeds  a  predefined  threshold,  acquisition  is 
declared  (i.e.  lock  is  detected!  and  the  baseband  retimes  the 
input  so  that  it  is  aligned  before  moving  on  to  State  l.  If  all 
620  delays  are  checked  and  the  baseband  does  not  detect 
lock,  the  UWB  receiver  turns  off. 

In  State  L  the  channel  estimation  phase,  the  baseband 
must  acquire  channel  estimates  from  the  output  of  the 
correlators.  This  must  be  done  before  demodulation  m  order 
to  compensate  for  the  detrimental  effects  in  the  UWB 
channel  [2j.  The  channel  estimates  are  used  to  construct  a 
five  tap  FIR  matched  filter  that  takes  both  the  pulse  shape 
and  channel  impulse  response  into  account. 

In  State  2,  the  detection  of  payload  phase,  the  baseband 
watts  for  the  end  of  the  preamble  which  is  indicated  by  an 
inverted  replication  of  the  Gold  code.  During  State  l  and  2, 
the  baseband  conti  nuously  performs  correlations  to  check 
that  the  baseband  remains  locked.  If  a  threshold  is  not  met, 
the  packet  is  assumed  to  be  lost  or  to  have  been  a  false 
packet  lock,  the  baseband  and  the  rest  of  the  UWB  receiver 
turns  off.  In  addition,  the  baseband  performs  delay 
correction  with  the  use  of  a  delay  locked  loop  which  is  part 
of  the  retiming  block. 

Finally,  in  State  3,  the  demodulation  phase,  each  pulse  of 
the  payload  is  filtered  by  the  matched  filter  derived  from  the 
channel  estimates  and  then  passed  through  a  decoder  that 
resolves  the  bit. 

A  block  diagram  of  the  baseband  is  shown  in  Figure  3. 
This  paper  exploits  two  forms  of  parallelism.  N  defines  the 
degree  of  parallelism  required  to  operate  the  digital 
baseband  in  sub-threshold.  M  is  defined  as  the  number  of 
Gold  Code  correlations  performed  simultaneously.  Each 
sub-bank,  composed  of  N  correlators,  checks  for  one  Gold 
Code  delay.  The  trade-offs  involved  in  the  specification  of 
M  and  N  will  be  discussed  in  the  following  sections.  Other 
papers  have  discussed  the  use  of  parallelism  to  reduce  power 
consumption  for  a  baseband  that  uses  both  autocorrelation 


and  cross-correlation  [3]:  however,  the  metric  here  is  to 
reduce  the  energy  consumption  for  a  baseband  that  uses  only 
cross-correlation. 


Fig,  3,  UWB  Parallelized  Digital  Baseband 


4.  SUB-THRESHOLD  OPERATION  (IMPACT  OF  N) 

As  previously  mentioned,  since  the  input  from  the  ADC 
arrives  at  a  rate  of  500MSFS,  a  serial  baseband  must  run  at  a 
frequency  of  500MHz  if  the  input  is  to  be  processed  in  real 
time.  In  order  that  the  critical  paths,  ihrough  the  correlator 
and  through  the  matched  filter,  meet  the  timing  constraint, 
the  digital  circuitry  must  run  at  its  maximum  supply  voltage. 
However,  running  at  the  maximum  voltage  is  not  energy- 
efficient.  It  is  important  to  reduce  the  energy  of  the 
correlator  since  it  consumes  the  largest  portion  of  energy  in 
the  baseband  during  synchronization.  The  energy  per 
operation  can  be  reduced  by  lowering  the  supply  voltage 
(Vjj)  [4],  At  maximum  V^,  the  transistors  in  the  circuit 
operate  in  the  active  region.  If  is  lowered  below  the 
threshold  voltage  (Vtb)  of  the  device,  the  circuit  is  said  to  be 
operating  in  the  sub-threshold  region.  Lowering  Vlk, 
increases  the  latency  per  operation  (T^^)  linearly  in  the 
active  region,  and  exponentially  in  the  sub-threshold  region. 
This  increases  the  leakage  energy  as  n  is  linearly  related  to 
Tpmuj.  There  is  a  minimum  operating  energy  point  since  the 
dynamic  energy  and  the  leakage  energy  scale  in  an  opposite 
manner  with  VJd  [5].  Spectre  simulations  of  the  correlator  in 
the  90-nm  process  show  that  operating  at  the  minimum 
energy  point  of  0.3V  rather  than  at  the  maximum  Vjj  of  IV 
reduces  the  energy  per  operation  of  the  correlator  bv  89% 
(Figure  4). 

At  the  minimum  energy  point,  the  baseband  processing 
must  be  parallelized  to  maintain  a  throughput  of  500MSPS. 
For  ease  of  design,  it  is  desirable  that  the  PRF  of  the 
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preamble  be  a  multiple  of  the  clock  frequency  of  the 
baseband.  Since  this  is  not  possible  at  0.3V,  the  baseband 
operates  slightly  above  the  minimum  energy  point  at  0.4V 
with  a  frequency  of  25MHz.  which  requires  N=2Q  correlators 
to  form  a  sub- bank  of  correlators.  Lowering  the  supply 
voltage  from  IV  to  04V  (sub-threshold)  results  in  an  overall 
energy  savings  of  83%  for  the  correlators  and  6S%  for  the 
entire  baseband.  The  energy  savings  for  the  entire  baseband 
is  less  since  buffers  are  inserted  in  some  paths  to  compensate 
for  the  increased  transition  time  at  0.4V. 


Fig,  4.  Simulated  energy  plot  for  the  correlator 
5,  REDUCED  ACQUISITION  TIME  (IMP4  CT  OF  M) 


Increasing  M  increases  the  number  of  code  shifts  that  arc 
simultaneously  checked.  Although  this  increases  the  power 
consumed  by  the  baseband  during  acquisition,  the  time  spent 
in  acquisition  decreases  proportionally.  In  this  section  a 
model  is  developed  to  show  that  the  baseband  energy 
remains  approximately  the  same  for  any  Mt  while  the  energy 
spent  by  the  rest  of  the  receiver  scales  inversely  with  M. 
This  results  in  an  overall  reduction  in  energy  per  packet. 

5.1.  Modeling  Energy  per  Packet. 


The  average  time  and  amount  of  energy  the  baseband 
spends  in  each  state  must  be  determined.  The  total  time  the 
baseband  spends  in  State  0  and  2  is  set  by  the  number  of 
times  the  Gold  code  is  repeated  in  the  preamble,  R(M), 
which  can  be  reduced  by  increasing  parallelism  M, 


R{M)  = 


(1) 


While  the  time  spent  in  State  I  and  State  3  is  fixed,  the 
distribution  of  time  between  State  0  and  2  is  dictated  by 
when  the  baseband  detects  lock.  The  maximum  time  the 
baseband  will  remain  in  State  0  is  In  this  case, 

no  time  is  spent  in  State  2, 

Let  D  be  the  number  of  code  shifts  between  the  Gold 
code  in  the  preamble  and  the  Gold  code  in  the  baseband. 
Assume  that  D  is  uniformly  distributed  over  [0,  N„-l ].  Let 
I(D.M)  be  the  number  of  code  durations  (T^)  required  to 


achieve  acquisition  in  State  0,  Let  Pd  be  the  probability  that 
the  baseband  detects  lock  when  the  input  and  the  Gold  code 
are  aligned,  and  P Tj.m  be  the  probability  Lhat  the  baseband 
detects  lock  in  one  or  more  of  M  delays  which  are  not 
aligned  to  the  code.  For  small  Pfa(=PfJ  i),  PEa,M  '  Mx?fiL 
Assuming  the  detector  is  ideal  (Pj=I,  Pfj.M=0), 


1{DM}  = 


D_ 

M 


(2) 


The  maximum  number  of  code  durations,  required 
to  achieve  acquisition  in  State  0  is  R(M).  As  the  baseband 
performs  different  operations  during  each  state,  the  energy 
per  T varies  per  state,  Eq  and  E2  are  the  energies 
consumed  over  Ttlxki  in  State  0  and  State  2,  while  Es  and  E* 
are  the  energies  required  to  perform  channel  estimation  and 
demodulation  respectively.  It  is  important  to  note  that  Ea,  to 
a  first  order,  scales  linearly  with  M.  After  acquisition,  M-l 
of  the  correlator  sub-banks  can  be  turned  off  through  the  use 
of  dock  gating  and  power  gating  so  that  Ej.  E2  and  E3  are 
not  dependent  on  M  to  a  first  order. 

The  energy  per  packet  is  computed  as  follows, 

RIM  > 

Energy ( D, M  )  =  V  Pr( X,DM)x  Energyi X.DM) 

(^) 

-OE^+0Ey  +  +£E3 


Pr(XJXM)  is  the  probability  that  the  baseband  will  stay 
in  acquisition  for  X  units  of  Tctxle  given  a  packet  with  delay 
D.  Energy(X,D,M)  is  the  energy  consumed  bv  the  baseband. 
For  all  X#(D*M), 

?r{X.D,M)  =  Pra_u[l-PM.M)x"  (4) 


Energy{X,  D,M)  =  (XE0  +  E, )  (5) 

When  X=](D.M). 

Pr(  X ,  D,  M )  x  Energyi  X ,  D,  M ) 

=  ^(l-  Pfa-„)'‘"{XEa  +  E,  +{R(M  )-*)£,  +  £-,}  (6) 

+  a-PJ){l-P^MY"R(M)E0 

This  paper  assumes  Pd=0.9  and  PfJ=l0  \  which  were  derived 
from  the  802J53a  proposal.  The  average  energy  required 
by  the  baseband  to  process  a  packet  for  a  given  degree  of 
parallelism  M  is  computed  by  taking  the  expected  value  of 
the  energy  per  packet  over  all  possible  delays  D,  conditioned 
on  M.  If  ?,■*  is  small,  this  average  baseband  energy  does  not 
change  significantly  since,  to  a  first  order,  the  same  number 
of  operations  occur  for  any  M.  In  addition,  for  a  small  PfDt 
the  required  preamble  time  and  hence  energy  spent  during 
acquisition  by  the  resi  of  the  receiver  scale  inversely  with  M, 


5,2.  Impact  of  M  on  Energy  per  Packet 

The  energy  per  packet  can  be  broken  down  imo  the 
preamble  energy  and  the  payload  energy.  The  payload 
energy  is  fixed  by  the  number  of  bits  transmitted  per  packet. 
However,  the  length  of  the  preamble,  and  consequently  the 
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preamble  energy*  can  be  reduced  based  on  the  configuration 
of  the  baseband.  A  previous  version  of  the  UWB  baseband 
checked  one  combination  of  the  Gold  code  at  a  time  [6\.  In 
order  to  check  all  shifted  combinations  of  the  3 1  -bit  Gold 
code,  the  baseband  must  perform  at  least  31  correlations* 
The  preamble  must  last  for  NflxTprexR(M=l)  -  34.8B0ps. 

□  Digital  Baseband  □  ADCs  □  Baseband  Amplifiers  ■  RF  front  end 


Parallelism  (M) 

Fig,  5.  Average  packet  energy  consumption  of  the 
receiver  subsystems  Tor  various  degrees  of  parallelism. 

By  using  multiple  sub-banks  of  correlators  that  operate 
in  parallel*  the  number  of  Gold  code  shifts  that  can  be 
checked  in  one  cycle  is  increased*  which  reduces  the  number 
of  repetitions  required  in  the  preamble.  In  a  fully 
parallelized  baseband*  with  31  sub-banks  of  correlators*  all 
31  shifted  possibilities  of  the  Gold  code  are  checked 
simultaneously*  and  the  Gold  code  only  has  to  be  repeated 
once  in  the  preamble  for  acquisition.  This  results  in  a  3 lx 
reduction  in  the  preamble  length.  As  previously  stated,  for 
varying  degrees  of  M*  the  energy  spent  by  the  baseband  on 
the  acquisition  is  almost  the  same  with  a  slight  increase  due 
to  increased  interconnect  capacitance  that  results  from 
parallelism.  The  actual  energy  savings  result  from  the  other 
circuitry  in  the  UWB  receiver  Reduction  in  acquisition 
time  implies  that  the  entire  receiver  needs  to  be  on  for  a 
much  shorter  period  of  time.  The  RF  front  end*  ADCs  and 
the  baseband  amplifiers  can  be  turned  off  once  the  packet 
has  been  demodulated.  The  measured  power  of  these  blocks 
is  approximately  19%  of  the  receiver  power  [7]*  [8];  shutting 
them  off  earlier  translates  into  significant  energy  savings. 
Figure  5  shows  the  reduction  in  energy  per  packet*  with 
payload  size  of  500  bytes*  for  various  degrees  of  parallelism. 
It  can  be  concluded  that  faster  synchronization*  combined 
with  duty-cycling,  reduces  the  energy  required  to  receive  a 
UWB  packet.  As  increasing  M  only  affects  preamble  energy* 
the  impact  of  using  parallelism  to  reduce  energy  per  packet 
varies  with  payload  size  (Figure  6), 

It  is  important  to  note  that  reduction  in  preamble  energy 
should  not  be  made  at  the  expense  of  the  payload  energy. 
Techniques  such  as  dock  gating  and  power  gating  are  used 
to  ensure  that  the  power  consumption  of  the  baseband  during 
demodulation  does  not  increase  with  parallelization.  During 
State  I.  2  and  3*  either  most  or  all  correlators  are  turned  off 


and  power  gated  to  reduce  leakage,  and  clock  gating  should 
be  used  to  reduce  the  impact  of  the  increased  interconnect 
cap  ac  i  tancc  due  to  para)  Id  is  m. 


6.  CONCLUSION 

This  paper  discusses  how  parallelism  allows  for  voltage 
scaling  and  reduced  acquisition  time*  which  reduces  the 
energy  required  to  receive  a  UWB  packet.  Voltage  scaling 
to  sub-threshold  allows  the  correlator  sub-banks  to  operate 
near  the  minimum  energy  point*  resulting  in  an  energy  per 
operation  reduction  in  the  correlators  of  83%  and  energy 
reduction  of  68%  across  the  entire  baseband.  The  reduced 
acquisition  time  through  further  parallelization  of  the 
correlator  sub-banks  by  31  led  to  a  43%  reduction  in  energy 
per  packet  for  a  500  byte  packet.  The  analysis  in  this  paper 
can  be  mapped  to  other  high  performance  communication 
applications  using  sub-threshold  operation  and  parallelism. 
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Abstract 

A  L2V  6m W  50GMS/s  5-bit  ADC  for  use  in  a  UWB  re¬ 
ceiver  has  been  fabricated  in  a  pure  digital  65nm  CMOS  tech¬ 
nology.  The  ADC  uses  a  6-channel  time-interleaved  successive 
approximation  register  architecture.  Each  of  the  channels  has  a 
split  capacitor  array  to  reduce  switching  energy  and  sensitivity 
to  digital  timing  skew.  A  variable  delay  line  is  used  to  optimize 
the  instant  of  latch  strobing  to  reduce  preamplifier  currents. 
Keywords:  UWB,  ADC,  EAR,  CMOS,  time -interleaved 

Introduction 

Ultra- wideband  (UWB)  radio  is  an  emerging  technology 
that  shows  promise  tor  very- high-data- rate  wireless  commu¬ 
nication  over  short  distances.  High  speed  (>500MS/s)  and 
low  resolution  (4-5 b)  ADCs  are  required  to  convert  these 
signals.  It  is  desirable  for  integration  of  the  ADC  directly  with 
the  high-performance  UWB  digital  baseband  processor  in  a 
deep  sub-micron  CMOS  process  for  best  digital  performance. 
Low-power  time- interleaved  successive  approximation  register 
(EAR)  ADCs  have  been  demonstrated  at  the  speeds  necessary 
for  UWB  radio  1 1  ],  [2].  The  SAR  topology  is  well  suited  for 
implementation  in  deep  sub-micron  CMOS  due  to  its  very  lowr 
analog  complexity. 

This  paper  presents  a  500MS/S  5-bit  ADC  in  pure -digital 
65nm  CMOS.  The  ADC  has  6  time -interleaved  channels 
synchronized  to  a  common  clock;  each  channel  uses  six  clock 
periods  to  perform  a  conversion  (one  for  sampling  followed  by 
five  bit-cycles);  thus  the  channels  sample  sequentially  every 
clock  period.  The  ADCs  have  been  designed  to  lake  advantage 
of  the  process  technology  without  sacrificing  robustness  in 
the  presence  of  increased  variability.  Two  new  techniques  are 
incorporated  to  improve  energy -efficiency,  A  split  capacitor 
array  reduces  switching  energy  and  is  robust  to  digital  delay 
mismatches.  In  the  comparator,  a  variable  delay  line  and  on- 
chip  delay  detector  optimize  the  instant  of  strobing  for  the 
regenerative  latch  to  lengthen  settling  times  for  preamplifiers. 

Technology  Considerations 

Deep  sub- micron  CMOS  provides  both  opportunities  and 
challenges  lor  mixed-signal  design.  The  SAR  architecture 
can  benefit  greatly  from  reduced  features  sizes  because  it 
has  significant  digital  but  little  analog  complexity.  The  two 
principal  analog  blocks  in  a  SAR  convener  are  the  capacitor 
array  DAC  and  the  comparator.  The  former  benefits  directly 
from  the  reduced  gate  length  and  lower  on  resistance  of 
the  switches.  Sampling  at  the  lower  power  supply  (1.2V)  is 
achieved  by  constraining  the  input  voltage  to  the  0-0,4V  range; 
thus  a  standard  Vj  NMOS  samples  the  input.  The  comparators 
use  a  two  stage  preamplifier  and  a  regenerative  latch.  Each 
preamplifier,  seen  in  Fig.  2(a),  uses  non-minimum  length  input 


transistors  to  improve  both  matching  and  output  impedance. 
While  this  increases  the  device  capacitance  for  the  same  gm, 
the  presence  of  wiring  parasiiics  reduces  the  overall  impact. 

Split  Capacitor  Array 

The  DAC  is  the  first  implementation  of  the  split  capacitor 
array,  wherein  the  MSB  capacitor  is  split  into  an  identical 
copy  of  Lhc  rest  of  the  anay,  theoretically  analyzed  in  [3],  This 
array  is  predicted  to  have  37%  lower  switching  energy  than 
the  conventional  array  and  t -step  switching  method  without 
any  increase  in  total  capacitance  or  area.  Besides  the  energy 
savings  presented  in  [3],  the  split  capacitor  array  is  also 
well  suited  for  high-speed  implementations.  In  a  conventional 
array,  when  two  capacitors  are  required  to  transition  on  a  given 
bit-cycle,  variation  in  digital  propagation  delays  can  cause 
the  array  output  to  initially  transition  in  the  wrong  direction, 
producing  a  large  overdrive  condition  for  the  preamplifiers, 
increasing  their  settling  limes.  In  the  split  capacitor  array, 
only  one  capacitor  switches  during  any  bit-cycle,  providing 
inherent  robustness  against  these  digital  timing  skews,  as 
shown  in  Fig.  I ,  Under  the  worst-case  timing  skew,  ihe  settling 
time  is  reduced  by  10%, 

Optimized  Latch  Strobing 

During  bit-cycling,  the  dock  period  is  divided  into  one 
phase  for  the  settling  of  the  DAC  and  preamplifiers  and  one 
phase  for  regeneration  of  the  latch.  The  latch  typically  resolves 
in  much  less  than  one  Ins  even  for  very  small  inputs.  The 
ADC  sits  idle  after  the  latch  settles  until  the  start  of  the  next 
bit-cycle.  Self-timed  bit-cycling  has  been  proposed  to  use  this 
idle  time  to  start  the  next  bit-cycle  early  [4],  This  approach 
relaxes  the  preamplifier  settling  time  requirement  for  all  but 
the  first  bit-cycle  (determining  the  MSB),  as  it  has  no  prior 
bit-cycle  from  which  to  borrow.  Here,  a  variable  delay  line 
has  been  inserted  in  series  wuth  the  latch  strobe  signal  (Fig. 
2(a))  to  extend  analog  settling  time  in  the  first  half  of  every 
bit-cycle,  including  the  first,  (ipre- borrowing”  time  from  that 
bit-cycle's  own  latch  phase.  The  slower  speed  requirements 


Fig.  1.  5- bit  split  capacitor  array  and  simulated  settling  behavior  under  the 

presence  of  digital  timing  mismatch 
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allows  reduced  preamplifier  currents.  To  tune  this  delay  for 
various  dock  frequencies  and  operating  conditions,  an  on- 
chip  delay  detector  has  been  designed,  shown  in  Fig.  2(b).  The 
latch's  inputs  are  shorted  to  produce  the  worst  ease  settling 
behavior  and  its  outputs  Lire  captured  both  by  a  replica  of 
the  SAR  digital  path  and  the  Done  signal  in 

R 4.  Any  difference  between  these  outputs  is  an  indication 
of  the  failure  of  the  latch  to  resolve  fast  enough  to  meet  the 
setup  time  constraints  of  R1-R2,  and  thus  the  delay  should  be 
reduced.  An  off-chip  loop  is  used  to  determine  the  frequency 
of  errors  and  tune  the  delay  via  a  configuration  register  This 
function  could  he  implemented  on-chip  with  a  counter  and  a 
simple  finite  state  machine. 


Measured  Results 

The  ADC  has  been  fabricated  in  a  pure-digital  65 nm 
CMOS  technology  with  a  nominal  supply  voltage  of  1.2V. 
At  500 MS/s,  the  analog  and  digital  supplies,  excluding  I/O 
power  consume  2.86mW  and  3.06m W,  respectively.  Using 
a  separate  on-chip  tesi  channel  with  the  conventional  array 
and  switching  method,  the  measured  DAC  energy  savings  for 
the  split  capacitor  array  is  31%,  which  closely  matches  the 
theoretical  model;  increased  bottom-plate  routing  accounts  for 
the  difference.  The  static  linearity  is  -0.16/Q.lo  INL  and 
—0.25/0.26  DNL  (Fig.  3);  the  split  capacitor  array  shows 
no  linearity  degradation  versus  the  conventional  array.  The 
dynamic  results  are  presented  in  Fig.  4.  The  SNDR  does  not 
drop  by  3dB  until  past  the  Nyquist  frequency.  An  FFT  of 
a  239.04  MHz  input  sampled  at  500MS/s  is  shown  in  Fig. 
5,  The  level  of  offset  voltage  mismatch  and  timing  skew  is 
sufficiently  low  for  proper  reception  of  UWB  signals.  Using 
the  figure  of  merit  in  [1],  { P/(2ENOB2fin )),  at  the  Nyquist 
frequency,  the  ADC  achieves  755IJ/conv.  step.  At  250MS/s, 
42GfJ/conv.  step  is  achieved  by  lowering  the  voltage  supplies. 
The  die  photograph  is  shown  in  Fig,  6. 

Acknowledgments  The  authors  would  like  to  thank  Texas 
Instrumnis  for  fabricating  the  chip.  This  work  is  funded  by 
NSF,  DARPA,  and  an  NDSEG  Fellowship. 


Fig  2.  Variable  delay  line  to  extend  preamplifier  settling  times  in  (a>  the 
comparator  circuit  and  fb)  the  latch -delay -detect  circuit. 
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Fig.  3.  INL  and  DNL  versus  output  code 
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Fig.  4  SNDR  and  SFDR  versus  input  frequency. 
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Fig.  5,  FFT  of  239.04MHz  sine  wave  sampled  at  500MS/S  with  dominant 
spurs  labeled.  UVfd)  are  from  timing  skew,  and  leMD  are  from  offset 
mismatch 


Fig.  6.  Photograph  of  1.9  x  i  2mm  die 
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ABSTRACT 

Sub- threshold  operation  is  a  com  pel  ting  approach  for  energy- 
constrained  applications,  but  increased  sensitivity  to  varia¬ 
tion  must  be  mitigated.  We  explore  variability  metrics  and 
the  variation  sensitivity  of  stacked  device  topologies.  We 
show  that  upsizing  is  necessary  to  achieve  robustness  at  re¬ 
duced  voltages  and  propose  a  design  methodology  to  meet 
yield  constraints.  The  need  for  upsizing  imposes  an  energy 
overhead,  influencing  the  optimal  supply  voltage  to  mini¬ 
mize  energy.  Finally,  we  characterize  performance  variabil¬ 
ity  by  summing  delay  distributions  of  each  stage  in  an  ar¬ 
bitrary  critical  path  and  achieve  results  accurate  to  withir 
10%  of  Monte  Carlo  simulation. 

Categories  and  Subject  Descriptors:  B.8.1  [Reliability, 
Testing,  and  Fault-Tolerance] 

General  Terms:  Performance,  Design,  Reliability 

Keywords:  Sub- threshold  circuits,  Minimum  energy  point. 
Delay  model 

L  INTRODUCTION 

In  sub-threshold  circuits,  the  power  supply  is  set  below 
the  transistor  threshold  voltage  Vp  to  obtain  energy  savings 
when  speed  is  not  the  primary  constraint  [1].  Authors  of 
[2]  [3]  derived  analytical  expressions  For  the  optimum  Vqd  to 
minimize  energy  in  sub-threshold  and  showed  its  dependence 
on  major  circuit  parameters.  Sub-threshold  circuits  rely  on 
leakage  currents  that  are  exponentially  dependent  on  Vp 
and  are  therefore  more  sensitive  to  process  variation  than 
traditional  above- threshold  designs, 
ft  was  suggested  in  [4]  that  minimum  size  devices  are  the¬ 
oretically  optimal  for  minimizing  energy  in  sub-threshold. 
However,  minimum  size  devices  have  increased  sensitivity  to 
Vp  variation  because  avT  is  roughly  proportional  to 
If  a  minimum  size  circuit  does  not  function  at  the  optimum 
Vdd  due  to  degraded  logic  output  swing,  it  is  necessary  to 
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upsize  devices  to  improve  robustness  at  the  expense  of  in¬ 
creased  energy  consumption.  Therefore,  variability  must  be 
considered  when  analyzing  the  minimum  energy  operating 
point. 

Previous  work  in  [5]  addresses  intra-die  variation  by  pro¬ 
viding  statistical  models  for  energy'  and  delay  of  an  inverter 
chain  in  sub- threshold.  An  empirical  expression  for  the  op¬ 
timum  voltage  is  shown  as  a  function  of  logic  depth,  assum¬ 
ing  complete  functionality  at  Knm*  Work  in  [6]  presents 
a  unified  delay  variability  expression  for  strong-  and  weak* 
in version  and  applies  it  to  a  NAND  gate.  Researchers  have 
also  proposed  various  approaches  to  optimize  delay  yield  by 
tuning  Vdd/Vt  or  choosing  gates  of  different  drive  strengths, 
for  example  in  [7|,  However,  functional  yield  was  not  consid¬ 
ered  until  [8]  [9],  which  address  unsatisfactory  Vqh  and  Vol 
in  sub- threshold  inverters  whose  output  levels  are  degraded 
by  leaking  devices,  such  as  in  a  register  file.  Body  biasing  is 
another  option  for  mitigating  variation  in  sub-threshold  (10] 
when  a  triple-well  process  is  available. 

We  address  inter-  and  intra-die  variation  and  show  that 
functionality  in  sub- threshold  circuits  may  be  compromised 
without  proper  design  for  variations.  We  first  explore  vari¬ 
ability  metrics  for  the  inverter  and  logic  gates  with  stacked 
devices,  and  propose  a  metric  to  size  logic  gates  for  a  fixed 
failure  rate  under  process  variation.  We  then  examine  the 
energy  versus  profile  given  the  failure  rate  constraint 
and  find  the  optimum  sizing  and  supply  voltage.  We  present 
an  efficient  methodology  to  model  delay  variability  of  a  chain 
of  logic  gates  and  characterize  the  effect  of  yield- based  sizing 
constraints  on  performance  variability. 


2.  VARIABILITY  METRICS  AND  DEVICE 
SIZING 

A  commonly  used  expression  for  sub-threshold  current  is 
given  by  [11] 
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where  n  is  the  sub-threshold  swing  factor,  Vth  the  thermal 
voltage,  and  r?  the  DIBL  coefficient.  The  nominal  current 
scales  linearly  with  Wf  L ,  while  standard  deviation  of  Vp  dis¬ 
tribution  reduces  w  ith  (VFL)“  ? .  thus  lowering  sub-threshold 
current  variation.  This  section  explores  how  sizing  affects 


variability  in  output  swing  and  active  current  in  the  inverter 
and  stacked  device  topologies, 

2.1  Logic  Gate  Output  Swing 

In  the  sub- threshold  regime,  the  ratio  of  active  to  idle  cur¬ 
rents  in  a  logic  gate  is  much  lower  than  in  strong  inversion. 
If,  for  example,  process  variation  strengthens  NMOS  relative 
to  PM  OS,  a  pull-up  network  will  not  be  able  to  drive  the 
logic  gate  output  fully  to  Vdd  because  of  idle  leakage  in  the 
pull-down  network.  This  degradation  in  gate  output  swing 
is  illustrated  in  Figure  1(a).  The  solid  line  shows  the  voltage 
transfer  characteristic  (VTC)  of  a  minimum  size  inverter  in 
a  65nm  technology  at  skewed  global  process  corner.  Dashed 
lines  plot  the  VTCs  w  hen  random  local  Vt  mismatch  is  ap¬ 
plied  to  the  inverter.  One  case  shows  a  severely  degraded 
Vql,  which  can  cause  functional  error  if  it  is  above  the  input 
low  threshold  (V)l)  of  the  succeeding  gate.  Therefore,  Vt 
variation  significantly  impacts  circuit  functionality  in  deeply 
seal  ed  tech  n ologi  es . 


(a)  (b) 


Figure  1-  (a)  Inverter  VTCs  at  skewed  process  cor¬ 
ner  with  random  Vt  mismatch,  (b)  Butterfly  plot 
of  N AND/NOR  gates  with  functional  output  levels, 
(c)  Butterfly  plot  of  N AND  with  failing  Vol *  (d)  Ex¬ 
ample  circuit  for  verifying  logic  gate  output  levels* 

A  consistent  metric  is  necessary  to  determine  whether  a 
logic  gate  has  sufficient  Vol  and  Vo h  levels.  Arbitrary  lim¬ 
its.  such  as  10%  and  90%  of  Vpd*  do  not  scale  well  across 
global  process  corners.  For  example,  at  the  strong- PMOS 
weak- NMOS  corner,  strong  leakage  through  PM  OS  raises 
Vol  of  all  gates  above  ground.  This  also  shifts  VTCs  to  the 


right,  and  thus  logic  gates  can  tolerate  higher  Vol  in  the 
preceding  gate.  Instead  of  arbitrary  limits,  we  propose  us¬ 
ing  butterfly  plots  to  verify  output  voltage  levels,  specifically 
in  the  context  of  standard  cell  design. 

2.  /  .  /  Use  of  the  Butterfly  Plot 

To  verify  Vql  of  a  given  gate,  we  superimpose  its  VTC 
with  the  mirrored  VTC  of  NOR,  since  the  latter  has  the 
most  stringent  V/t  requirement  from  stacked  devices  in  the 
pull-up  network  and  parallel  devices  in  the  pull-down.  Sim¬ 
ilarly,  we  verify  Vqh  using  the  NAND  VTC,  which  ha $  the 
worst  case  Vf&. 

In  Figure  1(b),  a  NAND  gate  has  sufficient  output  swing 
such  that  Vql-nand  produces  a  logic  high  output  in  a  suc¬ 
ceeding  NOR  gate.  In  contrast,  the  NAND  gate  in  Fig¬ 
ure  1(c)  exhibits  Vol- van o  =65 mV  and  produces  a  NOR 
output  of  136m Vr  close  to  mid-rail  and  thus  causing  logic 
failure. 

A  gate  with  failing  output  levels  is  analogous  to  a  6T 
SRAM  cell  displaying  negative  static  noise  margin  (SNM), 
in  that  the  butterfly  plots  for  both  cases  do  not  contain  an 
inscribed  square.  Therefore,  we  can  also  apply  [12]  to  find 
the  side  of  the  largest  inscribed  square,  illustrated  in  Figure 
1(b),  Figure  1(d)  show's  an  equivalent  circuit  for  this  mea¬ 
surement  on  two  back-to-back  logic  gates.  Because  the  VTC 
is  input-dependent,  all  inputs  are  varied  simultaneously  to 
obtain  the  worst  case  Vjh  and  Vjl 

It  was  shown  in  [13]  that  the  SNM  of  two  back-to-back 
gates  G1  and  G2  is  equal  to  the  maximum  noise  that  can  be 
applied  to  all  gates  in  an  infinitely  long  chain  of  alternating 
G1  and  G2,  before  logic  failure  occurs.  Thus  when  verify¬ 
ing  a  standard  cell  G  using  the  butterfly  plot,  we  essentially 
assume  that  all  logic  paths  in  a  synthesized  circuit  are  com¬ 
posed  of  alternating  G  and  NAND3  gates  with  the  same 
two  skewed  VTCs.  To  accurately  model  the  failure  rate  of  a 
custom-designed  logic  path,  we  would  plot  VTCs  of  all  gates 
and  trace  the  signal  propagation  through  the  path.  Exact 
modeling  is  not  possible  for  standard  cell  design  wire  re  the 
target  circuit  is  unknown.  Therefore,  although  the  butter¬ 
fly  plot  does  not  reflect  the  exact  mismatch  conditions  in  a 
circuit,  it  does  provide  a  guideline  for  sizing  standard  cells 
consistently  to  account  For  local  variation. 

2.  L2  Failure  Rate  From  Insufficient  Output  Swing 

We  now  define  logic  failure  as  having  no  inscribed  square 
in  the  butterfly  plot  and  measure  how  the  failure  rate  varies 
with  Vdd  and  device  sizing.  To  consider  logic  gates  with 
up  to  three  stacked  devices,  we  verify  the  INV.  NAND2, 
and  NOR2  gates  against  NAND3  and  NOR3.  which  give 
the  most  stringent  Vih  and  Vjl  requirements  respectively. 
Sizing  of  NAND 3  and  NOR3  are  fixed  to  provide  a  starting 
point  for  designing  the  remaining  gates. 

The  failure  rate  is  estimated  from  a  5k- point  Monte  Carlo 
simulation  at  worst  case  temperature.  Vt  of  transistors  in 
the  gate  under  test  and  global  (inter-die)  process  conditions 
are  randomized  such  that  the  Monte  Carlo  runs  are  anal¬ 
ogous  to  sampling  logic  gates  across  multiple  dies.  Figure 
2(a)  shows  the  failure  rate  versus  V'0/7  of  an  inverter  at  vari¬ 
ous  widths  normalized  to  minimum  size.  Simulated  values  in 
markers  are  fitted  to  an  exponential  function  aebl,  drawn  as 
a  solid  line.  Note  that  the  failure  rate  decays  more  quickly 
when  W  — 1,66  compared  to  W—  1.  Furthermore,  zero  sain- 


pies  failed  in  the  5-k  point  run  at  higher  voltages,  as  indi¬ 
cated  by  arrows  on  the  graph. 


voltage  drop  across  T2  and  the  worst  case  leakage  across 
TL  This  circuit  is  used  in  a  Monte  Carlo  simulation  while 
varying  the  Vt  of  each  transistor  and  inter-die  process  con¬ 
ditions.  Figure  2(b)  plots  the  resulting  failure  rate  in  the 
cross- coupled  inverters.  Similar  to  the  case  of  logic  gates, 
the  failure  rate  decreases  exponentially  to  zero  when  either 
width  or  Vdd  is  increased. 


v™ 
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Figure  4:  Static  register  schematic  and  equivalent 
circuit  for  measuring  SNM. 


Figure  2;  Failure  rate  of  (a)  inverter  and  (b)  static 
register  vs.  Vdd i  plotted  for  various  NMOS  and 
PM  OS  widths  (normalized  to  minimum  size). 
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Figure  3:  Output  swing  failure  rate  of  the  inverter, 
NAND2,  and  NOR2,  plotted  against  device  width 
(normalized  to  minimum  size).  Vdd  is  set  at  240mV 
for  demonstration. 

Figure  3  plots  the  failure  rate  versus  normalized  device 
width  of  INV,  NAND2,  and  N0R2.  In  the  inverter,  both  de¬ 
vice  sizes  are  varied  simultaneously.  In  NAND2  and  N0R2, 
the  critical  two- transistor  stack  is  changed  while  the  two 
parallel  devices  are  kept  constant.  The  failure  rates  also 
decay  exponentially  with  widths.  By  increasing  the  device 
width  or  Vdd,  the  failure  rate  can  be  made  to  approach  0, 

2.2  Noise  Margin  in  Registers 

The  concept  of  noise  margin  is  also  relevant  in  sub-thresh¬ 
old  register  design,  where  data  retention  is  a  particular  chal¬ 
lenge.  Dynamic  registers  suffer  from  charge  leakage,  which 
worsens  in  sub-threshold  due  to  slow  circuit  speeds.  There¬ 
fore.  we  consider  the  static  transmission-gate  based  regis¬ 
ter.  Similar  to  SRAM  cells,  the  data  retention  capability 
of  the  register  is  reflected  in  the  hold  static  noise  margin  of 
its  cross-coupled  inverters.  Figure  4  shows  the  equivalent 
circuit  For  measuring  the  register  SNM,  accounting  for  the 


2.3  Current  Variability 

In  addition  to  output  swing,  active  current  variability  is 
another  metric  of  interest  since  it  relates  directly  to  vari¬ 
ation  in  propagation  delay  With  the  common  assumption 
that  Vt  is  normally  distributed,  sub-threshold  current  can 
be  modeled  as  a  lognormal  random  variable.  From  the  prop¬ 
erty  of  lognormal  distributions,  the  coefficient  of  variation 
of  active  current  is  given  by 


It  was  observed  in  [5]  that  as  Vqq  reduces,  the  sub- thresh¬ 
old  swing  factor  n  decreases.  This  leads  to  higher  un cer¬ 
tainty  in  the  sub-threshold  current  through  a  single  device. 
To  examine  the  impact  of  topology.  Figure  5  plots  simu¬ 
lated  cr iMlab  / versus  device  width  for  static  CMOS  prim¬ 
itives  consisting  of  one  to  three  devices  in  series.  Variability 
decreases  with  larger  widths  as  expected.  Stacked  device 
topologies  clearly  display  lower  spread  in  active  currents. 
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Figure  5;  (a)  Monte  Carlo  setup  for  current  vari¬ 
ability  measurement,  (b)  Active  current  variability 
of  different  CMOS  primitives  vs.  device  width  (nor¬ 
malized  to  minimum  size)  at  Vpo— 300m V, 


2.4  Constant  Yield  Device  Sizing 

We  now  address  the  issue  of  device  sizing  for  single  and 
stacked  device  topologies,  given  the  metrics  of  output  sw  ing 
and  current  variability.  In  above- threshold  design ,  series 
devices  are  sized  to  give  equivalent  resistance  as  the  inverter. 
However,  in  sub-threshold  design  w^hen  the  objective  is  to 
minimize  energy s  device  sizes  should  be  kept  as  small  as 
possible  while  satisfying  variability  constraints. 

Compared  to  a  single  device,  stacked  devices  display  lower 
current  spread  but  higher  uncertainty  in  output  levels,  which 
may  lead  to  functional  errors.  Reducing  the  error  rate  clearly 
takes  precedence,  so  output  swung  rather  than  current  vari¬ 
ability  should  be  considered  first  in  sizing  decisions. 

The  output  swing  failure  rate  versus  width  plot  of  Figure 
3  illustrates  a  sizing  methodology  for  single  and  stacked  de¬ 
vices.  Suppose  we  constrain  all  topologies  to  have  the  same 
failure  rate,  or  interchangeably,  a  constant  yield.  We  ob¬ 
tain  the  required  device  sizes  by  drawing  a  horizontal  line 
at  the  desired  failure  rate,  then  finding  where  this  line  inter¬ 
sects  the  failure  curve  and  the  corresponding  x-axis  value. 
In  Figure  3,  a  target  failure  rate  of  0.13%  requires  a  single 
and  2-stack  XMOS  to  be  sized  at  2  and  4.43  times  minimum 
width  respectively.  1-PMOS  is  sized  the  same  as  t-NMOS  as 
both  devices  are  varied  together  in  simulation.  The  2-stack 
sizing  here  can  be  used  for  any  static  CMOS  gate  with  two 
series  NMOS,  since  it  was  derived  from  NAND2  where  two 
leaking  parallel  PM  OS  give  the  worst  case  Vol* 

Because  the  failure  rate  reduces  at  higher  Vdd,  the  re¬ 
quired  size  for  a  given  yield  constraint  also  decreases.  The 
resulting  energy  trade-olf  will  be  analyzed  in  Section  3,1. 
Table  1  lists  device  widths  for  a  constant  failure  rate  of 
0.13%  while  Vdd  is  varied  at  20mV  intervals.  0.13%  rep¬ 
resents  the  3a  tail  of  a  normal  distribution  and  is  chosen 
for  demonstration.  It  should  be  noted  that  such  a  target 
allows  sizing  logic  gates  consistently,  but  does  not  relate  in 
a  straightforward  way  to  the  failure  rate  of  a  circuit  built 
from  these  gates.  As  mentioned  previously,  this  value  is 
a  pessimistic  estimate  because  it  assumes  that  every  second 
gate  in  the  circuit  is  N AND3  or  NOR3.  Furthermore,  failing 
logic  gates  tend  to  cluster  on  die  at  process  corners. 

Table  X:  Required  widths  (normalized  to  minimum 
size)  vs,  Vdd  for  constant  failure  rafce^G.13% 
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3.  MINIMUM  ENERGY  OPERATION 

I'lie  total  energy  per  operation  consumed  by  an  arbitrary 
circuit  is  modeled  in  [2j  as 

Er  —  Edyn  T  El  —  Oef/VoD  +  V  r f f  hHakVoDtd^DP  (4) 

Edyn  and  El  model  the  dynamic  switching  and  leakage 
energy  per  cycle  respectively.  C*//  and  We/f  denote  the 
average  total  switched  capacitance  and  normalized  width 
contributing  to  leakage  current,  td  and  Ite&k  represent  the 
delay  and  leakage  current  of  a  characteristic  inverter,  while 
l, dp  is  the  logic  depth  in  terms  of  the  inverter  delay.  As 
Vdd  decreases,  Edyn  is  lowered  quadratics! ly.  The  leakage 


current  reduces  because  of  DIBL,  but  td  goes  up  exponen¬ 
tially  at  sub- threshold  voltages  and  causes  a  similar  increase 
in  leakage  energy.  The  two  opposing  trends  give  rise  to  an 
optimal  supply  voltage  VoDopt  at  which  total  energy  is  min¬ 
imized,  assuming  the  circuit  is  functional. 

Section  2  has  shown  that  functionality  is  no  longer  guaran¬ 
teed  at  lowr  supply  voltages  when  Vp  variation  is  significant. 
Reducing  the  probability  of  logic  failure  requires  either  up¬ 
sizing  devices  or  increasing  Vdd,  which  must  be  considered 
when  finding  VoDopt-  This  can  be  accounted  for  within  the 
framework  of  [2]  by  treating  C€fj  and  M4//  as  a  function  of 
Vdd-  The  resulting  energy'  versus  Vdd  characteristic  of  an 
inverter  chain  and  32-bit  Kogge-Stone  adder  are  simulated 
in  a  65nm  process  and  presented  as  examples. 

3.1  Minimum  Energy  Point  with  Yield  Con¬ 
straint 

Figure  6  plots  C€ff  and  tVe//  versus  Vdd  for  the  Kogge- 
Stone  adder  under  two  sizing  schemes.  The  solid  line  plots 
energy  of  designs  satisfying  an  upper  bound  on  the  output 
swing  failure  rate,  derived  from  constant  yield  sizing  of  Table 
1.  The  dashed  line  indicates  an  adder  with  only  minimum 
size  devices.  Note  that  Wc//  is  obtained  by  normalizing  the 
adder  leakage  current  to  that  of  a  characteristic  inverter  [2]. 
DIBL  affects  leakage  through  the  two  circuits  differently  as 
Vdd  decreases,  causing  a  slight  increase  in  Vt>//  in  this  case. 
VpDcnt  denotes  the  critical  operating  voltage  at  which  min¬ 
imum  size  devices  can  be  used  to  satisfy  the  yield  constraint. 
When  Vdd  >  VoDcrit ,  the  circuit  under  both  schemes  arc 
identical. 

It  should  be  noted  that  once  the  yield  constraint  is  set. 
VpDcrit  can  be  Found  immediately  from  Table  1  and  the 
topology  of  a  given  circuit.  For  example,  a  circuit  with¬ 
out  stacked  devices  does  not  require  upsizing  when  Vdd  > 
VoDcrit  —  300m V.  In  contrast,  a  circuit  with  stacks  of  two 
NMGS  has  Voomt  —  340m V. 


(a)  (b) 

Figure  6:  (a)  Cc/f  and  (b)  for  adder  with  con¬ 

stant  yield  (CY)  and  minimum  sizing  (MS), 

The  switching,  leakage,  and  total  energy  of  the  inverter 
chain  and  adder  are  then  calculated  according  t.o  Equation  4. 
Fi  git  re  7(a)  plots  the  energy  versus  Vdd  characteristic  of  the 
inverter  chain  at  nominal  process  and  temperature.  Total 
energy  in  both  constant  yield  and  minimum  sized  chains 
are  dominated  by  the  dynamic  component.  Therefore,  the 
optimum  supply  voltage  of  the  minimum  size  chain  (dashed 


line)  is  the  lowest  Vdd  at  which  yield  constraints  are  met. 
By  definition,  this  is  equal  to  VoDcnt ■  In  the  constant  yield 
sizing  scheme  (solid  line),  reducing  the  supply  below  Voocrit 
necessitates  an  increase  in  device  widths.  The  resulting  rise 
in  Ccff  dominates  total  energy.  En  this  situation,  there  is 
no  benefit  from  upsizing  in  order  to  operate  at  lower  V&p. 
The  optimum  operating  point  is  with  minimum  sizing  at  the 
lowest  Vdd  permitted  by  the  failure  rate  constraint. 

When  the  minimum  size  circuit  does  have  a  local  mini¬ 
mum  in  its  energy  characteristic,  three  scenarios  exist  de¬ 
pending  on  the  relationship  between  VoDcnt  and  the  opti¬ 
mum  Vdd  of  the  constant  yield  {Vddopi-cy)  and  minimum 
sizing  (  Id  Dap  t- ms)  schemes. 

Case  t)  VoDopt— ms  >  VoDerit**  No  upsizing  is  required  to 
operate  at  the  minimum  energy  point,  therefore  a  minimum 
sized  circuit  at  Vdo^-ms  yields  optimum  energy. 

Case  2)  VpDopt-AfS  <  Vdd^pi-CY  <  VoDcrit-  A  mini¬ 
mum  size  circuit  cannot  operate  at  VoDept-MS  without  vi¬ 
olating  failure  rate  constraints.  A  circuit  suitably  upsized 
to  operate  at  VoDopt-CY  yields  optimum  energy  while  sat¬ 
isfying  yield  requirements. 


Case  3)  VoDopt-MS  <  VDDopt^CY  —  VoDcrit-  At  VoDcntj 
the  circuit  under  both  sizing  schemes  are  identical.  There¬ 
fore  a  minimum  size  circuit  operating  at  Vo Dr.nt  provides 
minimum  energy 

An  example  of  case  2  is  seen  in  Figure  7(b)  for  a  synthe¬ 
sized  32-bit  Kogge-Stone  adder  with  interconnect  parasitics 
extracted  from  layout.  Ignoring  failure  rate  constraints,  the 
minimum  size  adder  (dashed  Line)  has  an  optimum  supply 
voltage  of  VoDnpt-MS  =  280m V,  When  we  account  for  fail¬ 
ure  rate  constraints,  the  effect  of  constant  yield  sizing  (solid 
line)  is  to  add  energy  overhead  when  Vdd  <  Voocriv  This 
shifts  the  local  minimum  to  the  right,  hence  Vdd  opt -cy  > 
VoDapt-MS-  Here  VoDopt-cv'  is  also  <  Vddctu,  therefore 
the  adder  with  constant  yield  sizing  at  VoDopt-CY  =  300m V 
consumes  10.1%  less  energy  than  a  minimum  size  adder  at 
Vddctu  —  340m V.  In  this  example,  constant  yield  sizing 
results  in  a  small  reduction  in  energy  due  to  the  shallow 
minimum  of  the  energy  versus  Vdd  curve. 


(a) 


(b) 


Figure  7:  Energy  vs.  Vdd  of  (a)  11-stage  inverter 
chain  and  (b)  32-bit  adder.  Solid  and  dashed  lines 
indicate  CY  and  MS  sizing  respectively. 


4,  PERFORMANCE  VARIABILITY 
4.1  Delay  Variability  Modeling 

Circuits  in  sub-threshold  display  significantly  higher  delay 
variability  than  in  above-threshold,  therefore  proper  model¬ 
ing  is  essential  for  timing  %'erificatiom  This  section  presents 
a  methodology  to  efficiently  model  the  delay  distribution  of 
a  chain  of  logic  gates.  Using  this  model,  we  characterize  the 
delay  variability  of  the  Kogge-Stone  adders  of  Section  3,1. 

From  (2],  the  delay  of  a  sub-threshold  logic  gate  can  be 
modeled  as 


td  — 


KCsVdd 
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where  A'  is  a  delay  fitting  parameter.  Cg  is  the  output  capac¬ 
itance,  and  the  denominator  models  the  gate  active  current. 
Both  the  active  current  and  td  are  lognormal ly  distributed 
with  the  same  u  parameter.  Therefore,  delay  variability  is 
also  given  by  Equation  3.  It  depends  on  <Tvr,  which  de¬ 
creases  as  (VV'L)”i,  and  the  sub- threshold  swing  n,  which 
decreases  with  Vbs\  To  the  first  order,  afp  does  not  depend 
on  input  slew  or  load  capacitance. 

The  critical  path  delay  in  sub- threshold  is  a  sum  of  log¬ 
normal  random  variables  (RVs),  typically  approximated  as 
another  lognormal  RV.  Authors  of  [5)  derived  an  expres¬ 
sion  for  the  propagation  delay  of  a  chain  of  identical  invert¬ 
ers  using  the  Wilkinson  approximation.  Here  wc  employ 
the  Schwartz- Yeh  method  [14]  to  model  the  sum  of  non- 
identical  ly  distributed  lognormal  RVs.  The  delay  of  an  ar¬ 
bitrary  critical  path  can  then  be  obtained  by  summing  the 
pre-char  acted  zed  distributions  of  each  logic  gate  in  the  path. 

The  Schwartz- Yeh  method  is  an  iterative  algorithm  for 
calculating  the  sum  of  lognormal  RVs,  but  requiring  much 
less  computation  time  than  Monte  Carlo  simulation.  The 
modeling  methodology  using  this  algorithm  is  described  as 
follows: 


1)  Characterize  mean  delay  and  standard  deviation  (^a(ei 
o'safe)  of  each  logic  gate  in  a  cell  library,  under  one  input 
slew  and  output  load  condition. 

2)  Simulate  the  (N-stage)  critical  path  of  interest  at  nom¬ 
inal  process  corner  and  without  Vr  variation.  The  delay  of 
the  jth  stage  in  the  critical  path  gives  ftj-path,  for  j—l  to 
N. 

3)  For  each  gate  j  in  the  critical  path,  let  &j-path  = 

^  /fj -path/ gatei  where  G'j-gdfp  and  /ij-gofe  are 
characterized  in  i).  Since  the  delay  variability  c^/pj  is  ap¬ 
proximately  constant  across  input  slew  and  load  conditions, 
this  scales  the  p re- characterized  standard  deviation  of  each 
gate  to  the  input  slew  and  load  conditions  in  the  actual 
critical  path, 

4)  fij-path  and  cTj^path  characterize  the  distribution  of 
each  stage,  and  are  input  to  the  Schwartz- Yeh  algorithm  to 
generate  the  delay  distribution  of  the  entire  critical  path. 

The  above  methodology  is  applied  to  a  three-stage  chain 
consisting  of  INV-NAND-NOR  and  to  the  critical  path  of  a 
32-bit  Kogge  Stone  adder  at  300m V.  Table  2  compares  sta¬ 
tistical  model  results  with  a  1-k  point  Monte  Carlo  simula¬ 
tion  randomizing  Vr  of  all  transistors.  The  model  estimates 
the  mean  and  standard  deviation  of  the  path  delay  to  within 
a  few  percent  of  the  Monte  Carlo  results.  This  shows  that 
keeping  a/p  constant  provides  a  good  approximation. 


Table  2:  Delay  distribution  parameters  from  statis¬ 
tical  model  and  Monte  Carlo  simulation  at  300m V* 
Values  are  normalized  to  F04  delay. 


|  Model  |  Monte  Carlo  | 

|  %  Difference 

V-NAND-NOR 

Chain 

P 

4.957 

4.692 

—$m% 

CT 

1.561 

053 

4.51% 

Kogge-Stone  Critical  Path 

P 

36.52 

37713 

1,65% 

a 

7.038 

7.262 

“  3.09% 

This  method  \s  used  to  characterize  the  delay  distribution 
of  1)  32-bit  adder  with  constant  yield  sizing  at  — 

300m V,  and  2)  adder  with  minimum  size  devices  at  Vpocru 
=  340mV.  Table  3  shows  that  the  first  adder  exhibits  larger 
mean  and  \\a  delay,  since  VDDopt-CY  <  VoDcnt ■  However, 
the  delay  variability  of  both  adders  are  comparable,  indicat¬ 
ing  that  upsized  devices  in  the  first  adder  offset  increased 
variability  from  operating  at  a  lower  supply  voltage. 

Table  3:  Delay  distribution  comparison  of  two 
adders  from  Section  3.1.  Values  are  normalized  to 
F04  inverter  delay  at  VDDopt_Cy. 


Const.  Yield  Sizing 

Min.  Sizing 

M 

90. BS 

44.92 

<7 

— (7746" 

- (T557 - 

It  +  3  a 

143.3 

71.49 

<7pl 

0.1921 

0,1972 

4.2  Energy  Variability 

From  a  lk- point  Monte  Carlo  simulation,  we  characterize 
the  energy  distribution  of  the  adder  with  constant  yield  siz¬ 
ing  at  VoDopi-CY  and  the  other  with  minimum  size  devices 
at  VoDcnt-  As  suggested  in  [5],  the  switched  capacitance  is 
verified  to  vary  negligibly  with  Vt  mismatch  and  is  treated 
as  deterministic.  Figure  8(a)  shows  that  even  though  the 
former  adder  employs  larger  devices,  it  displays  lower  mean 
leakage  current  due  to  DIBL,  and  lower  variability  as  an 
additional  benefit.  The  first  adder  exhibits  lower  mean  to¬ 
tal  energy  but  higher  variability  in  Figure  8(b).  The  latter 
effect  results  from  the  delay  term  in  leakage  energy  having 
larger  mean  and  standard  deviation  at  3G0mV  compared  to 
340m V.  Note  that  the  leakage  component  is  a  product  of  two 
dependent  lognormal  RVs.  so  St  is  not  strictly  lognormally 
distributed, 

5.  CONCLUSION 

tn  this  paper,  we  have  examined  the  effect  of  variation 
and  sizing  on  single  and  stacked  device  topologies  in  sub¬ 
threshold  circuits.  Compared  to  a  single  device,  stacked 
devices  exhibit  lower  current  variability  but  a  higher  prob¬ 
ability  of  logic  failure  from  insufficient  output  swung.  We 
introduced  the  use  of  butterfly  plots  to  verify  logic  gates 
as  well  as  registers  against  process  variation,  and  showed 
that  upsizing  is  necessary  to  mitigate  degraded  output  lev¬ 
els.  The  need  for  upsizing  to  meet  a  given  yield  constraint 
imposes  an  energy  overhead  and  impacts  the  optimum  siz¬ 
ing  and  supply  voltage  at  which  energy  is  minimized.  We 
presented  a  methodology  to  model  delay  variation  in  an  arbi¬ 
trary  critical  path  using  the  delay  distribution  of  each  stage. 
Finally,  we  compared  the  delay  and  energy  variability  of  the 


(a)  (b) 

Figure  8:  (a)  Leakage  current  and  (b)  total  energy 
for  two  adders  of  Section  3,1,  normalized  to  those  of 
characteristic  inverter  at  VoDapt-CY  • 

proposed  sizing  scheme  with  a  minimum  size  circuit,  and 
showed  that  energy  reduction  is  possible  without  compro¬ 
mising  yield  or  performance  variability 
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ABSTRACT 

In  this  paper,  we  identify  the  key  challenges  that  oppose  sub¬ 
threshold  circuit  design  and  describe  fabricated  chips  that  verify 
techniques  for  overcoming  the  challenges. 

Categories  and  Subject  Descriptors 

B.7.I  [IC&]:  Types  and  Design  Styles 

General  Terms 

Performance.  Design,  Reliability 

Keywords 

Sub-threshold  digital  circuits,  low  voltage  memory,  dynamic 
voltage  scaling,  process  variations,  sub-threshold  logic 

L  INTRODUCTION 

Sub-threshold  operation  for  digital  circuits  first  was  shown  as 
the  means  to  minimizing  CMOS  VDD  m  1972  [1],  Analog  sub- 
threshold  circuits  subsequently  received  a  lot  of  attention  for  low* 
power  applications  (e.g.  [2][3]).  Interest  in  digital  sub-threshold 
was  revived  in  the  late  1990s  [4],  and  a  multiplier  was 
demonstrated  operating  m  sub- VT  at  0.475V  that  used  body  bias 
to  balance  p/n  currents  [5],  A  sub-VT  ring  oscillator  also 
employed  body  biasing  and  functioned  at  80m  V  [6], 

The  primary  motivation  for  using  sub-VT  circuits  is  to  reduce 
energy.  Analysis  of  energy  contours  in  [7]  demonstrated  that 
minimum  energy  operation  occurs  in  the  sub-threshold  region. 
Once  VDD<VT,  delay  increases  exponentially  with  additional 
voltage  scaling  Leakage  current  integrates  over  the  longer  delay 
until  leakage  energy  per  operation  exceeds  the  active  energy  and 
causes  the  minimum  point.  Models  capture  this  effect  and 
illustrate  the  impact  of  various  parameters  in  [8][9], 

The  potential  for  minimizing  energy  at  the  cost  of  speed 
degradation  defines  the  set  of  applications  for  which  sub- 
threshold  circuits  are  well-suited.  First,  energy -constrained 
applications  such  as  wireless  sensor  nodes,  RFID  tags,  or  implants 
are  dominated  by  the  need  to  minimize  energy  consumption. 
Speed  is  a  secondary  consideration  for  this  class  of  applications, 
so  sub-VT  circuits  offer  a  good  solution.  Secondly,  many  burst¬ 
mode  applications  require  high  performance  for  brief  time  periods 
between  extended  sections  of  low  performance  operation.  Sub- 
threshold  circuits  can  minimize  energy  for  computations  executed 
during  the  low  performance  slots.  Finally,  the  parallelism  inherent 
in  many  signal  processing  and  communications  circuits  can  be 
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exploited  to  scale  voltages  into  sub-VT,  providing  a  low  energy 
solution  for  throughput -centric  applications  {e.g.  [10]>, 

This  paper  describes  the  key  challenges  that  confront  sub¬ 
threshold  circuit  designers  and  presents  chips  that  overcome  the 
challenges. 

2.  Sub-Threshold  Logie:  FFT  Processor 

Static  CMOS  gates  continue  to  function  in  sub-VT>  but  some 
challenges  make  logic  design  more  difficult.  First,  CMOS 
processes  are  designed  with  strong- inversion  operation  in  mind, 
so  the  ratio  of  drive  current  in  sub-Vf  is  frequently  imbalanced 
relative  to  the  case  where  pMOS  and  nMOS  arc  symmetrica!.  The 
shaded  region  in  Figure  1  shows  the  operational  range  for  a  ring 
oscillator  in  0.18pm  CMOS  at  the  worst-case  comers.  VDD  is 
minimized  when  the  p/n  sizing  ratio  is  12.  which  indicates  that  the 
process  is  imbalanced  such  that  p/n  current  is  1/12  relative  to  the 
symmetric  case.  This  unfriendly  son  of  technology  imbalance  can 
aggravate  process  variations  and  even  require  different  circuti 
designs  for  different  imbalance  scenarios.  In  addition,  the  low* 
VDD  results  in  a  reduced  WW  ratio  that  can  reduce  robustness, 
especially  for  circuits  with  parallel  leakage  paths  [II]. 


F  igure  1:  Minimum  achievable  voltage  for  10^-90%  output 
swing  for  iMSpm  ring  oscillator  at  worst  case  process  comers 
(simulation). 

A  0. 1 8pm  CMOS  FFT  processor  uses  circuits  that  account  for 
these  challenges:  static  CMOS  logic  is  used  for  robustness,  gates 
with  parallel  leakage  paths  are  redesigned,  large  stacks  are 
avoided  to  improve  la„/tufr,  and  a  register- file  memory  uses  logic  - 
based  structures.  The  chip  is  fully  functional  for  128,  256,  512, 
and  1024  FFT  lengths  (8-bit  and  16-bit  precision)  at  VDD  from 
iSOmV  to  9CK>mV  [M].  Figure  2  shows  the  measured  energy 
consumption  for  8-bii  and  16-bit  processing  as  a  function  of 
voltage.  8-bit  processing  has  a  lower  activity  factor  and  thus  has 
lower  switching  energy.  However,  because  the  leakage  energy  is 
the  same  for  both  8-bit  and  16-bit  processing,  the  minimum 


energy  point  increases  to  400m V  from  350mV.  Al  the  16-bit 
optimum,  the  chip  rims  at  10  kHz  and  consumes  l55nJ/FFT, 
which  is  350X  more  energy  efficient  than  a  typical  low-power 
microprocessor  and  8X  more  energy  efficient  than  a  standard 
ASIC  implementation  [  1 1  ]. 


VDD  (mV) 

Figure  2:  Measured  energy  per  $-  and  16-bit  FFT  vs.  VD1>. 


3.  Scaling  Performance:  l!ltra-DVS 

Burst  mode  applications  cannot  exclusively  util  lire  sub-threshold 
operation  because  they  require  periodic  high  speed  functionality. 
Traditional  dynamic  voltage  scaling  (DVS)  could  be  extended  to 
include  sub-threshold  operation,  but  the  overhead  of  providing  the 
necessary  voltages  can  be  large.  Adjustable  DC-DC  converters 
tend  to  have  limited  efficiency  over  broad  voltage  ranges,  and 
they  take  100s  of  micro-seconds  to  switch,  An  alternative 
implementation  method  called  local  voltage  dithering  (LVD) 
offers  a  reduced  overhead  means  for  implementing  ultra- DVS 
(UPVS)  down  to  the  sub-threshold  region,  LVD  uses  power 
switches  to  select  from  among  two  or  more  VOD  supplies  at  the 
local  block  level  [12].  Figure  3  shows  an  example  system  that  has 
3  VDDs,  As  the  required  rate  (normalized  frequency)  for 
processing  incoming  data  changes,  each  block  spends  a  different 
fraction  of  its  operating  time  at  different  voltage  levels.  The 
averaging  effect  of  this  dithering  produces  an  energy  consumption 
profile  that  nears  the  optimal  (c,g.  infinite  voltage  levels)  profile. 


Vqdh 

Vqdm 

Vodi 


Figure  3:  Example  UDVS  system  using  LVD  and  three  Vpl>s. 
A  90nm  CMOS  test  chip  uses  LVD  to  implement  UDVS  for  32- 
bn  Kogge-Stonc  adders  [12],  Measurements  from  the  chip  show 
that  high  rate  (e,g,  >0.1)  dithering  can  occur  in  I  cycle  due  to  the 


local  granularity  of  the  headers.  Figure  4  shows  an  example 
energy  profile  for  a  UDVS  system  using  energy  measurements 
from  the  test  chip.  For  high  rates,  the  blocks  dither  between  the 
top  two  supplies  (UV  and  0,8V  in  the  figure)  to  achieve  near- 
optimal  energy  consumption.  When  performance  requirements 
relax  for  low  rate  operation,  the  blocks  can  hop  to  the  VDD  thai 
gives  minimum  energy  operation  (330mV  for  the  90nm  adder 
block)  to  achieve  9X  savings  in  energy  consumption. 


Rat*  ( normalized  frequency) 

Figure  4:  Energy  profile  based  on  90nm  chip  measurements 
For  example  3-VptJ  system. 

4.  Sub-Threshold  SRAM 

SRAM  is  an  important  component  of  many  ICs,  and  it  can 
contribute  a  large  fraction  of  the  active  and  leakage  power 
consumption.  It  is  important  to  have  sub-VT  compatible  SRAMs 
for  sub-VT  systems.  However*  the  nature  of  SRAM  circuits  makes 
them  a  melting  pot  of  all  of  the  major  sub-VT  challenges. 

Random  variation  fundamentally  affects  the  geometry  and 
threshold  voltage  of  CMOS  devices  and  is  increasingly  prominent 
in  scaled  technologies.  The  large  array  nature  of  SRAM  implies 
that  extreme  tails  of  the  distributions  limit  yield.  The  problem  is 
exacerbated  in  sub-VT,  where  device  strength  depends 
exponentially  on  threshold  voltage,  and,  in  the  presence  of 
variation,  relative  strengths  cannot  be  guaranteed  by  sizing.  As  a 
result,  the  widely  used  6T  SRAM  cell,  which  relies  on  ratioed 
operation  and  is  used  to  maintain  density,  fails  to  operate  m  sub- 
VT.  Figure  ja.b  show  the  read/hold  and  write  static  nose  margins 
[13J  respectively  for  a  typical  6T  cell  and  for  the  3 o  case.  At 
reduced  voltages,  read  margin  is  negative  and  wntc  margin  is 
positive,  indicating  failure  for  both  opera)  ions. 


Figure  5:  Simulated  SNM  for  (a)  read/ho  id  and  (b>  write. 


The  increased  impact  of  variation  on  device  strength  in  sub-VT 
also  has  a  limiting  effect  on  SRAM  performance  and  integration. 
SRAM  cell  read  current,  Ird,  decreases  exponentially  in  sub-VTt 
but  the  speed  is  ultimately  set  by  the  weakest  cell  in  the  array. 
Figure  6a  plots  Ian  for  cells  on  the  weak  side  of  the  distribution 
normalized  to  the  mean  (i,e,  Ire/pOrd))-  The  limiting  effect  of 
cell  strength  variation  is  amplified  in  sub-VT  where  cells  can  be 
over  an  order  of  magnitude  weaker  than  the  mean, 


(b) 


Figure  6:  Effect  of  cell  variation  on  (a)  worst  ease  rend 
current  and  (b)  bit-line  leakage. 

Parallel  leakage  also  limits  voltage  scaling  for  SRAM.  In 
conventional  6T  SRAM,  a  stored  “I"  is  read  dynamically  from  a 
precharged  bit- line.  However,  the  reduced  lur/lq)f  ratio  in  sub-Vr 
is  lowered  even  more  due  to  the  unaccessed  cells  sharing  the  bit- 
line.  which  results  in  a  degraded  logic  level  Sub-VT  bit-line 
leakage  is  less  problematic  at  high  voltages  where  the  discharge 
time  of  an  accessed  cell  is  much  faster  than  that  of  the  aggregate 
unaccessed  cells.  However,  where  variation  extends  the  required 
discharge  time,  bit-line  leakage  severely  limits  the  number  of 
cells  that  can  be  integrated  onto  a  column.  Figure  6b  shows  the 
leakage  current  of  127  unaccessed  cells  normalized  to  the  drive 
current  of  a  single  accessed  cell  weakened  by  variation.  Values 
greater  than  unity,  which  occur  in  sub-VT,  imply  that  drive  current 
is  indistinguishable  from  leakage,  making  reliable  read  accesses 
impossible. 

Numerous  techniques  have  been  reported  to  mitigate  the  low- 
voltage  SRAM  problems  described  above.  For  instance,  reduced 
bit-line  precharge  voltages  and  negative  word-line  bias  for 
unaccessed  cells  have  been  used  to  increase  the  read  SNM. 
Similarly,  increased  word-line  bias  and  negative  bit-line  voltages 
have  been  used  to  improve  the  write  SNM.  While  these 
approaches  can  improve  the  situation  for  sub-VT  SRAM, 
approaches  that  address  the  problems  more  fundamentally  provide 
a  better  solution  for  robust  operation  in  sub- threshold. 

A  65nm  test  chip  implements  a  256kb  memory  that  overcomes 
the  problems  and  provides  functionality  in  the  sub -threshold 
region  to  below  400m V  [14].  The  SRAM  uses  a  IGT  bit-cell, 
shown  in  Figure  7.  M7-M 1 0  form  a  read  buffer  that  isolates  the 
interna]  storage  nodes,  Q  and  QB,  so  that  a  read  upset  is  not 
possible.  This  eliminates  the  read  SNM  problem  of  Figure  5a.  and 
stability  is  instead  limited  by  the  hold  SNM.  Measurements  from 
the  test  chip  show  that  the  cell  can  hold  data  correctly  below* 
250mV.  Write  operations  in  Figure  5b  fail  since  the  access 
devices  in  a  6T  bit-cell  are  too  weak  to  over-power  the  internal 
cell  feedback,  which  is  made  worse  by  process  imbalance  that 
makes  pMOS  sub-threshold  current  higher  than  nMOS  by  an 
order  oT  magnitude.  Robust  write  in  the  new  10T  cell  is  performed 
by  weakening  the  feedback  structure  by  floating  VVD£>.  Finally, 


bit-line  leakage  on  RBL  is  minimized  by  unconditionally  raising 
the  voltage  of  Q8B  for  unaccessed  cells.  This  relies  on  either  the 
active  pull-up  current  through  M9,  or  the  ratio  of  its  leakage 
current  to  that  of  MlO's.  In  either  case,  M8Ts  \fc,$  becomes 
negative,  resulting  in  vanishingly  small  sub-threshold  leakage 
current  to  the  bit-line.  This  structure  allows  256  bit-cells  to  be 
integrated  per  column. 


Figure  7;  Schematic  of  10T  sub-threshold  hit-cell  1 14]* 


5.  Conclusions 

Numerous  problems  increase  the  challenge  of  designing  robust 
sub- threshold  circuits.  Some  time-testing  design  practices,  such  as 
rat  iced  write  in  SRAM,  become  unreliable  due  to  the  exponential 
dependence  of  sub- threshold  drive  current  on  parameters  with 
large  process  variations.  We  have  presented  an  overview  of  the 
types  of  circuits  and  architectures  that  overcome  these  problems 
and  produce  working  designs.  Functional  implementations  of  a 
sub-threshold  FFT  processor  [11],  an  energy-scalable  UDVS  test 
chip  [12],  and  a  sub -threshold  SRAM  [14]  attest  that  robust  sub¬ 
threshold  systems  can  practically  offer  minimum  energy 
operation. 
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Minimizing  the  energy  consumption  ofbatteiy  powered  systems 
is  a  key  focus  in  integrated  circuit  design.  Switching  energy  of 
digital  circuits  reduces  quadratically  as  Von  is  decreased  below  VT 
(i,e.  sub -threshold  operation),  while  the  leakage  energy  increases 
exponentially.  These  opposing  trends  result  in  a  minimum  ener¬ 
gy  point  (MEP),  defined  as  the  operating  voltage  at  which  the 
total  energy  consumed  per  operation  (Eop)  is  minimized  [1J. 
Operating  circuits  at  their  MEP  [1,  2]  has  been  proposed  as  a 
solution  for  energy  critical  applications  and  the  analytical  solu¬ 
tion  of  the  MEP  has  been  derived  in  [3  l  The  MEP  can  vary  wide¬ 
ly  for  a  given  circuit  depending  on  its  workload  and  environmen¬ 
tal  conditions  le  g,,  temperature).  By  tracking  the  MEP  as  it 
varies,  energy  savings  of  50  -  100%  are  demonstrated  and  even 
greater  savings  can  be  achieved  in  circuits  dominated  by  leakage. 
In  this  paper,  a  65nm  CMOS  circuit  that  can  dynamically  track 
the  MEP  of  a  digital  circuit  with  varying  operating  conditions  is 
presented.  Embedded  within  the  tracking  loop  is  an  ultra-low- 
power  switching  DC-DC  converter  that  can  efficiently  deliver 
supply  voltages  down  to  250m V,  enabling  minimum  energy  oper¬ 
ation. 

Figure  3.2.1  shows  the  architecture  of  the  MEP  tracking  loop, 
which  adjusts  the  output  VD0,  to  the  minimum  energy  operating 
voltage  of  the  digital  circuit  (FIR  filter).  An  energy  sensor  circuit, 
together  with  an  energy  minimization  algorithm,  is  used  to  set 
the  reference  voltage  of  the  DC-DC  converter.  The  DC-DC  con¬ 
verter  maintains  VDD  close  to  the  reference  voltage.  The  key  ele¬ 
ment  in  the  loop  is  the  energy  sensor  circuit  which  computes  the 
Eop  of  the  digital  circuit  at  a  given  reference  voltage.  The  DC -DC 
converter  is  disabled  during  energy  sensing.  Assuming  that  the 
voltage  across  the  storage  capacitor  Ctead  falls  from  the  reference 
voltage  VL  to  V2  in  the  course  of  N  operations  of  the  digital  circuit, 
Eop  at  the  voltage  V,  is  equal  to  x  (Vf  -  VVV2N.  To  measure 
Eop  accurately,  should  be  close  in  value  (within  20  -  30mV)  to 
V,.  Methods  to  measure  E cpp  by  digitizing  V,  and  V.,  using  conven¬ 
tional  ADC’s,  or  by  sensing  the  inductor  current,  dissipate  a  sig¬ 
nificant  amount  of  overhead  power.  Our  proposed  energy  efficient 
approach  to  obtain  Enp  is  to  observe  that,  by  design,  V\  is  very 
close  to  V2,  Thus,  V/  -  V2*  can  be  simplified  to  2V\  x  (Vt  -  V2) 
within  an  acceptable  error.  Since,  the  digital  representation  ofV^ 
which  is  the  reference  voltage  to  the  DC-DC  converter,  is  already 
known,  only  the  digital  value  for  V1  -  Va  is  required  to  estimate 

E*p 

Figure  3,2.2  shows  the  voltage  difference  measuring  circuitry 
Before  starting  an  N  operation  energy  sense  cycle,  the  voltage 
across  is  sampled  on  C1  and  the  DC-DC  converter  is 

disabled.  The  digital  circuit  runs  for  N  operations  using  the  ener¬ 
gy  stored  in  ClMd,  and  the  voltage  across  CJm()  droops  to  some 
value  V,  (<  vi)i  which  is  then  sampled  across  C*.  Subsequently, 
the  DC-DC  converter  is  enabled  and  normal  operation  of  the  dig¬ 
ital  circuit  continues.  At  this  point,  a  current  sink  (Mt,  Ma) 
connected  across  C,  turns  ON  and  a  fixed  frequency  clock  drives 
a  counter.  The  fixed  frequency  clock,  together  with  the  constant 
current  sink  that  drains  C,,  quantizes  voltage  into  time  steps,  as 
in  an  integrating  ADC.  The  number  of  fixed  frequency  clock 
cycles  required  for  to  droop  down  to  Va  is  directly  proportional 
to  Vj  -  V,.  Once  the  value  of  Vj  -  V£  is  obtained  digitally,  it  is  mul¬ 
tiplied  with  V,  to  get  an  estimate  of  Efip. 

The  digital  representation  of  E„p  is  then  used  by  a  slope  descent 
algorithm  to  arrive  at  the  MEP  Based  on  the  value  of  Eop 
obtained,  the  algorithm  suitably  changes  the  reference  voltage  to 
the  DC-DC  converter.  Once  the  converter  settles  at  this  new  volt¬ 
age,  the  energy  sensing  operation  is  performed  again  and  the 


cycle  repeats  until  the  minimum  is  achieved.  At  this  point  the 
loop  shuts  down.  Figure  3.2.3  shows  measured  waveform  of  the 
tracking  loop  in  operation.  The  MEP  tracking  loop  can  be  enabled 
by  a  system  controller  as  needed  depending  on  the  application. 

Figure  3.2.4  shows  the  DC-DC  converter  embedded  within  the 
minimum  energy  tracking  loop.  The  converter  is  a  synchronous 
rectifier  buck  converter  [41  with  off-chip  filter  elements.  It  is 
designed  to  deliver  load  voltages  from  VD0  =  250m V  to  as  high  as 
Vdd  -  7 00 mV  at  ultra -low  load  power  levels  from  IpW  to  lOOpW. 
This  precludes  the  usage  of  high  gain  amplifiers  for  zero  voltage 
and  current  switching.  The  converter  implemented  uses  an  open 
loop  control  for  zero  current  switching.  Depending  on  the  load 
voltage  being  delivered,  an  appropriate  delay  is  multiplexed  in, 
turning  the  NMOS  off  when  the  inductor  current  approaches  zero 
(see  Fig.  3,2.4).  A  Pulse  Frequency  Modulation  (PFM)  control 
scheme  is  used  to  improve  efficiency  as  the  load  power  levels  are 
low.  The  clock  for  the  reference  voltage  comparator  is  derived 
from  the  critical  path  replica  ring  oscillator  which  feeds  the  digi¬ 
tal  circuit.  This  allows  the  comparator  clock  to  scale  automatical¬ 
ly  with  Vdd  and  hence  the  load  power,  eliminating  unnecessary 
comparisons.  The  simplicity  of  open-loop  PFM  mode  control  helps 
in  decreasing  the  power  consumption  of  the  control  circuitry, 
thereby  improving  the  low  load  efficiency  The  converter  efficien¬ 
cy,  plotted  from  measured  results  in  Fig.  3,2,5,  is  >80%  while 
delivering  load  powers  of  IpW  and  higher  and  86%  at  lOOpVV 
(Vdd=0.5V). 

Figure  3,2,6  shows  how  the  MEP  varies  with  workload  for  a  7-tap 
FIR  filter  implemented  in  65nm  CMOS.  Workload  is  changed  by 
varying  the  number  of  taps  of  the  FIR  filter.  The  MEP  decreases 
with  increasing  workload  because  the  ratio  of  the  active  energy  to 
total  energy  per  operation  increases.  It  can  be  deduced  from 
curves  1,  2  in  Fig.  3.2.6  that  110%  energy  is  saved  by  moving  V&D 
to  the  new  MEP  value  instead  of  staying  at  the  original  MEP 
value  of  3 20 mV  The  MEP  increases  with  temperature  as  the 
ratio  of  leakage  energy  to  total  energy  per  operation  increases. 
Energy  savings  of  the  order  of  50%  is  achieved  as  the  MEP  is 
tracked  when  the  temperature  changes  from  0  to  85*C.  The  ener¬ 
gy  savings  obtained  are  highly  circuit  dependent  and  can  be 
much  larger  in  modern  digital  10's,  which  dissipate  a  significant 
portion  of  power  in  leakage. 

The  energy  overhead  associated  with  obtaining  the  MEP  is  equiv¬ 
alent  to  the  energy  consumed  by  50  operations  at  the  MEP  in  the 
minimum  workload  scenario  (WL1).  The  proposed  minimum 
energy  tracking  loop  is  non-intrusive,  thereby  allowing  the  load 
circuit  to  operate  without  being  shut  down.  The  tracking  method¬ 
ology  is  independent  of  the  size  and  type  of  digital  circuit  being 
driven  and  the  topology  of  the  DC-DC  converter. 

Figure  3.2.7  shows  the  micrograph  of  the  test  chip  fabricated  in  a 
65 nm  CMOS  process.  The  active  area  of  the  chip,  which  includes 
the  digital  test  circuitry,  occupies  O.SSmnP  with  the  minimum 
energy  tracking  circuitry  occupying  0.05mm- ,  The  small  area  and 
energy  overhead  of  the  tracking  loop  facilitates  the  use  of  multi¬ 
ple  such  loops  for  each  distinct  voltage  domain  in  a  complex  digi¬ 
tal  system. 
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Figure  3.2,1;  Bloch  diagram  of  the  minimum  energy  Inching  loop  and  embedded 

DC  DC  converter.  Figure  3.2.2:  Circuitry  to  compute  Energy/operaNon  (E„p)  at  a  given  operating  voltage. 
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Figure  3.2,3:  Measured  waveform  showing  the  minimum  energy  tracking  loop  in  oper¬ 
ation.  VaD  starts  at  420mV  and  is  then  increased  to  47QmV.  The  loop  then  changes 
direction  and  reduces  VDD  to  37UmV  and  320mV  before  settling  at  the  MEP  of  370mV. 
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Figure  3.2.5;  Measured  efficiency  plot  ot  the  QC-DC  converter 


Figure  3.2,4;  Pulse  Frequency  Modulation  control  ot  the  DC-DQ  converter  showing 
open  loop  NMOS  pulse  width  determining  circuitry.  The  time  delays  are  chosen  to  turn 
the  NMOS  off  as  the  inductor  current  approaches  zero. 


E^p  vs  lor  varying  workloads 


Figure  3.2.6:  Measured  E*  curves  wilh  change  in  workload  for  a  7-tap  FIR  filter.  Curve 
1  has  an  intentional  IjiA  leakage  current  added  to  the  maximum  workload  scenario 
(curve  2).  'X'  denotes  the  measured  voltage  at  which  the  minimum  energy  loop  settles. 

Continued  on  Page  587 
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Figure  3.2.7:  Micrograph  of  the  test  chip  in  65nm  CMOS,  EMB  is  the  energy  minim  iz* 
mg  block  which  comprises  the  energy  sensor  circuitry  and  the  energy  minimization 
algorithm. 
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The  subthreshold  regime  is  a  critical  biasing  space  as  it  enables 
minimum  energy  operation  for  logic  circuits  [1].  However,  practi¬ 
cal  systems  rely  heavily  on  SRAMs,  which  conventionally  limit 
the  minimum  VDD  to  above  Vt.  SRAMs  often  dominate  the  total 
die  area  and  power,  and  minimizing  their  energy  requires  scaling 
VDD  as  low  as  possible.  In  this  work,  a  256kb  SRAM  in  65nm 
CMOS  is  presented  that  operates  in  sub-Vt  (at  350m V)  despite 
the  exponential  effect  Vt  variations  have  on  device  strength. 

The  6T  bit-cell  in  Fig,  18.4.1  provides  a  good  balance  between  sta¬ 
bility,  performance,  and  density  However,  in  the  presence  of  vari¬ 
ation,  it  fails  to  operate  in  sub-VT,  Figure  18,4,1  shows  a  Monte 
Carlo  simulation  of  the  SNM  1 2 1  for  both  read  and  hold  cases  of  a 
65nm  cell.  At  350mV,  hold  stability  is  preserved,  but  read  failures 
are  prominent.  Write  SNM  violations  (not  shown)  appear  in  a 
similar  manner.  Functional  errors  are  also  caused  by  severely 
degraded  Iread-  Figure  18.4,1  considers  the  case  of  256  cells  per 
column.  In  sub-VtT  the  values  stored  in  the  unaccessed  cells  can 
result  in  an  aggregate  leakage  current  on  the  shared  bitlines  that 
is  greater  than  the  3a  and  4a  read  currents,  implying  that  the 
data  in  the  accessed  cell  is  indistinguishable  from  bitline  leakage. 

To  overcome  these  challenges,  the  8T  bit-cell  shown  in  Fig.  18.4,2 
is  developed.  Buffered  read  eliminates  the  read  SNM  limitation; 
peripheral  footer  circuitry  eliminates  bitline  leakage;  peripheral 
write  drivers  and  storage-cell  supply  drivers  interact  to  reduce 
the  cell  supply  voltage  during  write  operations;  and  sense-amp 
redundancy  provides  a  favorable  trade-off  between  offset  and 
area  Previous  implementations  of  sub-Vt  memories  deal  with 
stability  read-current,  and  bitline  leakage  by  adding  devices 
within  the  cell  or  employing  hierarchy  to  limit  fan-in/out.  For 
instance,  a  10T  cell  operates  at  400m V  [31,  and  a  register- file  uses 
multiplexed  read  to  operate  at  3l0mV  [4],  In  this  design,  periph¬ 
eral  circuit  assists  are  used  to  maximize  density  and  reduce  the 
leakage  paths  to  those  of  a  6T  SRAM. 

In  Fig.  18.4.2,  the  read  buffer  is  composed  of  M7-M8.  Instead  of 
statically  connecting  its  foot  to  ground,  however,  a  foot- driver  is 
used  in  the  periphery.  As  shown  in  Fig.  18.4.3,  the  buffer-foots  of 
all  cells  of  the  same  word  are  shorted,  and  their  foot-driver  is 
shared.  During  a  read,  only  the  foot  of  the  accessed  word  is  driv¬ 
en  low;  all  others  remain  at  VDD,  Accordingly  after  RDBL  is 
precharged,  the  read-buffers  of  the  unaccessed  cells  have  no  volt¬ 
age  drop  across  them,  and  their  access  devices  have  a  negative 
V^.  Consequently  they  impose  no  sub-Vt  leakage,  and  dynami¬ 
cally  held  data  values  of  “1*  on  RDBL  can  be  sensed  successfully. 

The  foot-driver  is  required  to  sink  the  read  current  from  all  of  the 
accessed  cells.  Use  of  a  large  NMOS  to  accomplish  this  is  imprac¬ 
tical  since  it  would  impose  a  significant  area  and  leakage-power 
overhead.  Instead,  the  sub  charge-pump  circuit  shown  in  Fig. 
18.4,3  is  used.  The  voltage  boost  provided  by  typical  charge -pump 
implementations  suffers  from  VL  drops,  and  would  be  inadequate 
for  this  application.  Instead,  the  circuit  of  Fig.  IS. 4. 3  uses  a 
PM  OS  (Ml)  to  precharge  CBOosr  The  charge- pump  generates  a 
swing  of  nearly  2VDD  at  the  input  of  the  foot-driver,  enhancing  its 
current  by  over  two  orders  of  magnitude  while  reducing  varia¬ 
tion  dependencies  on  its  devices.  This  allows  the  devices  of  the 
foot-driver  to  be  near  minimum  sized  so  that  their  leakage- power 
is  insignificant.  Further,  since  the  charge-pump  drives  minimal 
load,  its  devices  and  boost  capacitor  can  be  small,  consuming  neg¬ 
ligible  powrer  and  area. 

Write  operations  fail  when  the  cell  pass  devices  cannot  overpow¬ 
er  the  internal  cell  feedback.  In  this  design,  write  (Fig.  18.4.4)  is 
performed  by  boosting  WL  by  SOmV  and,  more  importantly 


reducing  WDtJ  through  a  supply  driver.  Simultaneously  new  data 
is  written  primarily  by  pulling  the  desired  storage  node  low 
through  the  NMOS  pass  device.  Although,  the  opposite  storage 
node  is  only  weakly  pulled  high,  its  load  PMOS  provides  a  cur¬ 
rent  path  to  WDD,  Accordingly  all  cells  in  the  accessed  word  con¬ 
tribute  to  driving  WDtJ  high  through  one  of  their  NMOS  pass 
devices.  Relatively  large  devices  are  used  in  the  supply  driver, 
and  the  net  variation  in  the  pass  devices  and  write  drivers  tends 
to  average;  hence,  sizing  accurately  allows  WDD  to  be  set  to  a  low 
intermediate  voltage. 

The  write  mechanism,  which  is  essential  for  suh-Vt  operation, 
requires  each  word  to  have  a  separate  WD{J.  As  shown  in  Fig. 
18.4.5,  this  implies  that  columns  of  different  blocks  cannot  be 
interleaved  in  layout,  and  adjacent  columns  can  no  longer  share 
a  multiplexed  sense-amp.  Hence,  the  number  of  sense-amps 
required  increases,  and  each  must  fit  in  a  column  pitch 
Nominally  the  approach  of  large-signal  read,  which  is  advanta¬ 
geous  in  high-density  scaled  SRAMs  |5|,  is  used;  nonetheless,  the 
BL  voltage  levels  are  degraded,  due  to  gate- leak  age  and  other 
noise  mechanisms,  and  sense-amp  offsets  still  limit  yield.  To  rem¬ 
edy  this,  sense -amp  redundancy  is  employed.  Erroneous  reads 
occur  when  the  net  offset  of  each  sensing  network  is  greater  than 
the  input  voltage  swing.  Increasing  device  sizes  reduces  local 
variation,  accordingly  reducing  sense-amp  offset.  Redundancy, 
however,  allows  exclusive  selection  of  the  sense-amp  that  mini¬ 
mizes  the  achievable  offset.  Hence,  errors  now  depend  on  the 
joint  probability  that  all  sense-amps  have  an  offset  greater  than 
the  input  voltage  swing.  As  shown  in  Fig.  18.4.5,  the  error  prob¬ 
ability  for  a  half-sized  sense-amp  is  greater  than  that  for  a  unit- 
sized  sense-amp;  however,  Monte  Carlo  simulation  shows  that 
the  joint  error  probability  for  two  half-sized  sense-amps  is  lower 
than  that  fora  unit-sized  sense-amp.  Specifically,  a  factor  of  five 
improvement  is  observed  at  the  input  swings  of  interest  lie., 
50m V).  This  only  applies  where  the  errors  due  to  offset  are  uncor¬ 
related,  so,  a  pseudo-differential  sense-amp  structure  is 
employed  to  cancel  the  effects  of  global  variation. 

Increased  redundancy  yields  further  improvement,  but  the  over¬ 
head  of  selecting  between  redundant  sense-amps  and  storing  that 
selection  state  also  increases.  In  this  design,  two  sense-amps  are 
used,  requiring  the  minimal  support  circuitry  of  two  flip-flops 
and  a  few  logic  gates.  On  start-up  a  selection  routine  determines 
which  sense-amp  can  correctly  read  both  logic  “0"  and  T*,  and 
enables  only  the  corresponding  structure. 

The  SRAM  is  fabricated  in  a  65nm  CMOS  process  (Fig.  18.4.7). 
The  256kb  array  is  arranged  into  8,  256  row  x  128  column  blocks. 
Full  read  and  write  functionality  is  achieved  with  a  VI]tl  of  350mV 
(and  50mV  boosting  of  WL  drivers).  At  this  voltage,  the  SRAM 
operates  at  25kHz  and  consumes  2,83pW  during  read  and 
3,96pW  during  write.  As  shown  in  Fig.  18.4.6*  data  is  held  to 
300mV  where  the  leakage  power  is  1.92pW.  At  325mV  fewer  than 
0.05%  read/write  errors  are  observed. 
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Figure  18.4.1:  6T  cell  SN IV)  and  billine  leakage  (normalized  to  l^j  demonstrating  loss 

dJ  lunclionality  at  low  ullages.  Figure  ia.4.Z:  ST  cell  enabling  low-voltage  reatf/wrlte  and  sensing. 
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Figure  18.4.3:  Circuitry  to  eliminate  sub-V,  leakage  from  unaccessed  read-buffers.  Figure  18.4.4:  Cell  wrile  performed  by  weakening  local  feed-back.  Ceil  supply  sellles 

Peripheral  charge-pumps  ensure  buffer-fool  drivers  do  not  limit  lnu0.  to  low  intermediate  voltage  determined  by  supply  driver  and  wrile  drivers. 
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Figure  18.4.5:  Wllhoiit  mullipJexing,  sense-amplifiers  have  stringent  offset  and  area 

requirements.  Wild  redundancy,  errors  depend  on  joint  probabilities,  improving  oltset  Figure  18.4.6:  Scope  output  and  measurements  of  Gbmn  test-chip.  Array  reads  and 
for  a  given  area  constraint.  writes  at  350mV,  Data  is  correctly  retained  at  3QDmV. 
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Figure  18.4.7:  Die  photograph  of  25fikb  ST  SRAM  in  65nm  CMOS. 
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DESIGN  OF  AN  ULTRA-LOVV-VOLTAGE  UWB  BASEBAND  PROCESSOR 


Vivienne  Sze.  Anantha  P.  Chandrakasan 
Massachusetts  Institute  of  Technology 


ABSTRACT 

This  paper  presents  an  energy-efficient  UWB  baseband  pro¬ 
cessor  that  achieves  a  1 00- Mbps  throughput  while  operat¬ 
ing  at  a  sub-threshold  supply  voltage  of  QA  V,  While  sub- 
threshold  operation  is  traditionally  used  tor  low  energy,  low 
performance  applications  (e.g.  wrist-watches),  this  work  ex- 
amines  how  it  can  be  applied  to  low  energy,  high  performance 
applications  using  extreme  parallelism.  Measured  results  for 
a  20-pJ/bit  0.4- V  UWB  baseband  processor  are  presented. 
Power  gating  is  used  to  reduce  leakage  energy. 

1.  BACKGROUND  INFORMATION 

This  work  was  performed  during  the  master's  degree  program 
from  September  2004  to  June  2006  at  the  Massachusetts  In¬ 
stitute  of  Technology  in  Cambridge,  MA,  United  States,  The 
submission  category  is  “Operational  Chip  Design". 

2,  INTRODUCTION 

The  consumer  electronics  industry  is  exploring  the  use  of  Ultra 
wideband  (UWB)  communications,  a  short-range  high-data- 
rate  radio  technology,  to  complement  longer  range  radio  tech¬ 
nologies  such  as  Wi-Fi,  WiMAX,  and  cellular  wide  area  com¬ 
munications,  UWB  communications  can  be  used  to  send  data 
from  a  host  device  to  other  devices  within  the  immediate  area, 
eliminating  the  need  for  wires  and  increasing  mobility  [1  j. 
The  use  of  UWB  as  a  medium  for  high-data-rate  last-meter 
wireless  links  requires  that  UWB  radios  be  integrated  onto 
battery-operated  devices  such  as  mobile  phones,  handheld  de¬ 
vices  and  sensor  nodes.  Consequently,  there  is  a  need  for  an 
energy-efficient  UWB  transceiver.  The  main  contribution  of 
this  work  is  to  demonstrate  how  extreme  parallelism  in  the 
digital  baseband  processor  allows  for 

*  sub-threshold  operation  at  0,4  V  to  lower  energy  con¬ 
sumed  by  the  baseband  processor 

*  reduced  acquisition  time  to  lower  energy  consumed  by 
other  blocks  in  the  receiver 

This  is  the  first  work  to  demonstrate  the  use  of  sub-threshold 
operation  for  a  high  performance  application. 

This  paper  will  begin  with  a  description  of  the  UWB  spec- 
i ftcat ions  and  complete  receiver  architecture.  Next,  the  main 


Fig,  I.  Block  diagram  of  UWB  receiver 


functions  of  the  baseband  are  discussed.  This  is  followed  by 
a  description  of  how  parallelism  can  be  used  to  achieve  an 
energy-efficient  baseband  processor  and  an  explanation  of  the 
design  methodology  used  to  implement  it.  Finally,  the  mea¬ 
sured  results  and  the  test  setup  used  to  obtain  these  results  are 
presented, 

3.  UWB  SPECIFICATIONS  AND  RECEIVER 

The  FCC  has  authorized  UWB  wireless  communications  in 
the  3.1 -GHz  to  10.6-GHzbartd  with  a  minimum  bandwidth  of 
500  MHz  and  a  maximum  equivalent  isotropic  radiated  power 
spectral  density  of  -4L3  dBm/MHz  [2],  There  are  two  tech¬ 
nological  approaches  for  UWB  communications:  OFDM  and 
pulse-based.  This  work  focuses  on  the  latter  using  2-ns  binary 
phase -shift  keying  (BPSK)  pulses. 

The  receiver,  shown  in  Figure  I,  uses  a  direct-conversion 
architecture  in  the  from-end  and  the  in-phase  and  quadrature 
components  are  sampled  at  500  MSPS  by  two  5-bit  ADCs  [3 1. 
For  real-time  demodulation  of  the  UWB  packet,  the  digital 
baseband  must  perform  the  signal  processing  with  a  through¬ 
put  of  500  MSPS.  Synchronization  is  performed  entirely  in 
the  digital  domain  and  only  the  automatic  gain  control  (AGO 
is  fed  back  to  the  analog  domain.  The  baseband  was  imple¬ 
mented  using  a  standard  digital  logic  cell  library  in  the  90- nm 
process. 

The  UWB  packets  are  built  from  a  sequence  of  BPSK 
pulses  with  a  500- MHz  bandwidth.  The  transmitter  gener¬ 
ates  approximate  Gaussian  pulses  and  up-converts  the  packet 
to  one  of  14  channels  in  the  3. 1 -GHz.  to  10,6-GHz  band.  The 
physical-layer  of  each  packet,  shown  in  Figure  2,  can  be  di¬ 
vided  into  two  sections:  preamble  and  payload.  The  preamble 
contains  repetitions  of  a  jVc=3  I  bit  Gold  code  (PN  sequence) 


Receiver  Receiver 

Turns  ON  Turns  OFF 


Fig-  2,  UWB  physical-layer  packet  format 


GOLD  CODE 


Fig.  3.  Correlator  Architecture 


sent  at  a  pulse  repetition  frequency  (PRF)  of  25  MHz.  The 
payload  contains  the  actual  data  and  is  sent  at  a  PRF  of  100 
MHz  for  a  100-Mbps  data  rate  with  no  channel  coding. 


of  Lhe  payload  bus.  Demodulation  involves  the  use  of  a  5- 
hngered  RAKE  receiver  to  collect  and  optimally  combine  the 
signal  energy  received  on  the  multiple  echo  paths  using  the 
tap  gams  determined  by  the  channel  estimation,  A  hard  de¬ 
cision  is  made  at  the  output  of  the  maximum  ratio  combiner 
(MRC)  to  resolve  a  bit. 

The  total  energy  spent  on  receiving  the  UWB  signal  can 
be  divided  into  two  components:  acquisition  (preamble)  en¬ 
ergy  and  demodulation  (payload)  energy.  One  of  the  goals 
of  this  work  is  to  reduce  the  energy  spent  by  the  receiver 
on  acquisition.  Since  this  energy  does  noL  go  directly  to¬ 
wards  the  demodulation  of  the  data,  it  is  seen  as  overhead  en¬ 
ergy.  During  short  bursty  traffic,  where  the  payload  is  small, 
this  overhead  energy  accounts  for  a  significant  portion  of  the 
total  packet  energy.  Therefore,  it  is  desirable  to  minimize 
the  amount  of  overhead  energy  per  packet.  The  majority  of 
this  overhead  energy  goes  into  the  computation  of  the  cross- 
correlation  function.  In  this  work,  we  take  two  different  ap¬ 
proaches  towards  reducing  this  overhead  energy,  both  of  which 
exploit  the  use  of  parallelism. 

4,2,  Sub- threshold  Operation 

There  are  a  fixed  number  of  operations  required  by  the  base¬ 
band  in  order  to  compute  the  cross-correlation  function,  and 
therefore  in  order  to  reduce  the  energy  of  the  baseband,  we 
need  to  reduce  its  energy  per  operation.  The  first  approach 
involves  scaling  down  the  supply  voltage  ( Vdd)  such  that  the 
correlator,  which  computes  the  cross-correlation,  operates  at 
its  minimum  energy  point  [4],  The  minimum  energy  point 
occurs  since  the  total  energy  per  operation  is  composed  of 
dynamic  energy  and  leakage  energy. 


4,  UWB  BASEBAND  PROCESSOR 
4.K  UWB  Baseband  Operation 

The  baseband  processor  implements  acquisition,  synchroniza¬ 
tion  and  demodulation  by  transitioning  between  two  states  of 
operation.  The  preamble  is  used  by  the  receiver  to  achieve  ac¬ 
quisition  and  synchronization.  At  the  receiver,  the  baseband 
processor  computes  the  cross-correlation  function  (^[rc])  be¬ 
tween  the  incoming  noisy  preamble  (x[n])  and  a  clean  tem¬ 
plate  of  the  3 1 -bit  Gold  code  sequence  (/i[n]). 

;Vr-t 

y[n]  =  ^  z[k\  x  h[k  -  n] 

o 

The  computation  shown  above  is  performed  with  the  use  of  a 
correlator  (Figure  3). 

Peak  detection  is  performed  on  the  cross-correlation  to 
achieve  signal  acquisition  as  well  as  synchronization.  The 
cross-correlation  also  provides  the  channel  estimation.  Fol¬ 
lowing  synchronization,  the  baseband  performs  demodulation 


Etatai  — *  E dynamic  4" 

—  CeffVpD  +  lUakVooTtlrttiy 

From  the  above  equation  we  see  that  lowering  Vqd  decreases 
the  dynamic  energy.  While  reducing  Vdd  reduces  the  leakage 
power,  it  also  increases  the  delay  (Tdtiay)  of  the  gates.  When 
the  Vq£)  is  above  the  threshold  voltage  of  the  device,  the  de¬ 
lay  increases  linearly  with  Vdd*  and  there  is  no  significant 
change  in  the  leakage  energy;  however,  when  Vdd  drops  be¬ 
low  the  threshold  voltage  of  the  device,  both  delay  and  leak¬ 
age  energy  increase  exponentially.  Since  the  dynamic  energy 
and  leakage  energy  scale  in  opposite  directions  as  Vpp  de¬ 
creases,  a  minimum  energy  point  occurs  in  the  sub-lhreshold 
region.  Spectre  simulations  performed  on  the  correlator  in¬ 
dicate  that  the  minimum  energy  point  occurs  ai  0.3  V,  which 
gives  a  9X  energy  reduction  as  compared  to  the  full-scale  1  -V 
operation  (Figure  4).  Ideally,  it  would  be  desirable  to  scale 
Vdd  such  that  the  baseband  operates  at  this  minimum  energy 
point. 

However,  as  previously  mentioned,  the  baseband  process¬ 
ing  must  sustain  a  throughput  of 500  MS  PS  in  order  to  achieve 
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Pig.  4.  Simulated  energy  plot  for  the  correlator.  Parallelism  (M) 


real-time  demodulation.  This  can  he  achieved  by  a  single  cor¬ 
relator  operating  at  a  frequency  of  500  MHz  with  a  much 
higher  voltage  than  03  V,  but  we  have  shown  that  this  is 
not  energy  efficient.  Instead,  it  is  better  to  operate  in  sub¬ 
threshold  at  a  reduced  frequency,  and  utilize  parallelism  (L) 
in  the  baseband  to  meet  the  throughput  constraint. 

In  order  to  refrain  from  introducing  additional  complexity 
due  to  parallelism,  it  is  preferable  Lhat  the  operating  frequency 
be  a  factor  of  the  preamble  PRF  (25  MHz).  The  operating 
frequency  is  equal  to  25  MHz  if  the  supply  voltage  is  raised 
slightly  to  0.4  V.  Since  the  minimum  energy  point  is  shallow, 
this  slight  change  in  VDd  does  not  cause  a  significant  energy 
penalty.  By  operating  at  0.4  V  rather  than  l  V,  the  energy  per 
operation  is  reduced  by  almost  6X.  At  25  MHz,  the  correla¬ 
tors  need  to  be  parallelized  by  a  factor  of  L=20  in  order  to 
maintain  the  500-MSPS  throughput. 

This  form  of  parallelism  can  also  be  used  to  reduce  the  en¬ 
ergy  spent  on  the  demodulation  of  the  payload  bits.  The  MRC 
of  the  RAKE  receiver  is  parallelized  by  a  factor  of  4  such  that 
it  can  operate  off  the  same  supply  voltage  and  operating  fre¬ 
quency  as  the  rest  of  the  baseband.  Combining  parallelism 
with  sub- threshold  operation  delivers  energy  savings  for  re¬ 
ceiving  the  entire  UWB  packet. 

43,  Reduce  Acquisition  Time 

In  addition  to  reducing  the  energy  of  the  baseband,  we  would 
also  like  to  reduce  the  overhead  energy  spent  by  the  rest  of 
the  blocks  in  the  receiver.  The  entire  receiver  must  be  turned 
on  for  the  duration  of  the  entire  UWB  packet.  The  second 
approach  involves  reducing  acquisition  time  to  minimize  the 
overall  on-time  of  the  receiver,  which  include  the  RF  front- 
end,  two  ADCs  and  two  baseband  amplifiers.  When  com¬ 
bined  with  duty-cycling,  this  results  in  a  reduction  of  the  over¬ 
head  energy. 

Reduced  acquisition  time  can  be  achieved  by  computing 
multiple  points  in  the  cross-correlation  function  at  the  same 


Fig,  5,  Breakdown  of  overhead  energy 

time.  This  involves  replicating  the  correlator  architecture  and 
operating  them  in  parallel.  As  the  degree  of  parallelism  (M) 
increases,  the  maximum  lime  to  achieve  acquisition  decreases 
by  1/M,  The  number  of  Gold  code  repetitions  in  the  preamble 
can  subsequently  be  reduced,  which  results  in  shorter  packets 
that  translates  to  shorter  receiver  on-time.  Further  analysis  is 
presented  in  [5]. 

The  impact  of  this  on-time  reduction  can  he  seen  in  the 
reduction  of  the  overhead  energy  consumed  by  the  receiver 
shown  in  Figure  5.  These  values  were  derived  from  the  mea¬ 
sured  power  of  the  RF  front-end.  and  ADCs,  which  account 
for  19%  of  the  total  receiver  power  [3, 6].  When  the  baseband 
is  parallelized  by  the  length  of  the  Gold  code  (M=iVc=3l), 
all  points  of  the  cross-correlation  function  can  be  computed 
simultaneously,  which  minimizes  the  on-time  of  the  receiver, 
resulting  in  a  14X  reduction  in  overhead  energy.  For  a  4- kbit 
packet,  this  degree  of  parallelism  results  in  a  43%  reduction 
in  the  total  energy  per  packet  consumed  by  the  receiver. 


4,4.  Baseband  Architecture 

The  combination  of  these  two  approaches  results  in  a  highly 
parallelized  implementation  with  a  total  of  LxM-620  cor¬ 
relators  and  4  RAKE  MRCs.  The  parallelized  architecture 
is  shown  in  Figure  6,  There  arc  L-20  correlators  in  each 
sub-bank  in  order  to  maintain  the  500-MSPS  throughput,  and 
M=3 1  sub-banks  so  that  a!  I  points  oft  he  cross -correlation  can 
be  computed  at  the  same  lime.  The  first  form  of  parallelism, 
which  reduces  the  energy  of  the  baseband  processor,  is  deter¬ 
mined  by  the  frequency  of  the  correlator  near  its  minimum  en¬ 
ergy  point,  while  the  second  form,  which  reduces  ihe  energy 
of  the  other  blocks  in  the  receiver,  is  dictated  by  the  length  of 
the  Gold  code  sequence  (iVc). 


Fig.  6.  Architecture  of  highly  parallelized  energy  efficient 
UWB  baseband  processor 

5.  DESIGN  METHODOLOGY 

5.L  Circuit  Simulation  and  Implementation  Tools 

The  baseband  algorithm  was  first  verified  using  MATLAB  to 
ensure  correct  functionality.  This  setup  was  also  useful  in 
generating  test  vectors.  Initially,  only  the  correlator  was  syn¬ 
thesized  by  Synopsys  Design  Compiler  using  STM  icro elec¬ 
tronics*  90  mm  standard  cel]  library.  Cadence  Spectre  was 
then  used  to  simulate  the  correlator  to  determine  its  minimum 
energy  point.  The  standard  cell  library  was  re-charactcnzed 
with  Cadence  SignalStorm  for  the  optimum  voltage  point  of 
0,4  V.  In  sub-threshold  operation,  the  delay  of  the  gates  de¬ 
creases  with  temperature,  which  is  contrary  to  the  behavior  in 
active  region  operation.  This  is  because  /„//  increases  with 
temperature,  while  /ori  decreases  with  temperature.  The  cor¬ 
ner  library  characterizations  take  this  into  account  (Le.  the 
fast  comer  used  a  higher  temperature  than  the  slow  comer). 

With  the  use  of  Perl  scripting,  the  baseband  algorithm  was 
translated  into  digital  circuits  written  in  Verilog  with  the  ap¬ 
propriate  degree  of  parallelism  (L,M>.  The  entire  baseband 
processor  was  then  synthesized  with  the  0.4- V  library,  and 
Synopsys  Astro  was  used  for  place-and-route.  Distributed  clock 
gating  was  incorporated  for  further  power  savings  on  the  clock 
network.  For  instance,  this  ensures  that  the  large  correlator 
bank  is  not  docked  during  demodulation. 

Also,  due  to  the  high  degree  of  parallelism,  a  hierarchal 
approach  was  used  to  minimize  the  tum-around  time  of  the 
EDA  tools.  Synthesis  was  performed  in  the  following  order: 

1 .  a  single  correlator 

2.  the  correlator  sub-bank  (instantiate  20  correlators) 

3.  the  correlator  bank  (instantiate  3 1  sub-banks) 


4.  top-level  baseband  processor  ( instantiate  correlator  bank) 

Timing  verification  that  incorporated glohu!  variations  was 
performed  using  Synopsys  Prime  lime,  Jn  sub-threshold  op¬ 
eration,  the  impact  of  local  transistor- to- transistor  variations 
is  quite  severe.  Consequently,  for  additional  variation  analy¬ 
sis,  Monte  Carlo  simulations  were  performed  using  Spectre  to 
verify  timing  on  critical  paths.  The  circuit  was  also  simulated 
in  Nanosim  with  extracted  RC  parasincs, 

5.2.  I/O  Implementation  Considerations 

It  was  desirable  to  minimize  the  size  of  the  die  since  this  al¬ 
lows  for  a  smaller  package  with  lower  bond  wire  inductance 
and  cost.  Since  the  baseband  processor  was  pad  limited,  steps 
were  taken  to  reduce  the  number  of  I/O  pads.  The  control  bits 
were  read  in  serially  through  a  shift  register  which  helped  re¬ 
duce  85  signals  to  4.  In  addition,  rather  than  having  200  pads 
[-5  (Number  of  bits  in  ADCVx  2  (for  in-phase  and  quadrature 
components)  x  L  (Degree  of  parallelism  for  throughput)]  allo¬ 
cated  to  the  input  signal,  the  baseband  only  took  in  5  parallel 
complex  inputs  at  a  time  rather  than  20.  which  reduced  the 
input  signal  pads  to  50,  The  penalty  for  this  was  that  a  serial - 
to-parallel  convener  was  required  internally  on  the  baseband 
processor  resulting  in  a  second  clock  domain  of  100  MHz, 
The  timing  constraints  for  signals  crossing  the  clock  domains 
have  to  be  carefully  set  and  verified.  Finally,  test  points  at  var¬ 
ious  stages  of  the  baseband  were  passed  through  a  mux  such 
that  they  used  minimum  number  of  puds  w  ithout  compromis¬ 
ing  the  testability  of  the  chip. 

The  completed  chip  had  152  I/O  pads.  A  144  CQFP  pack¬ 
age  was  used,  which  required  8  ground  pads  to  be  down- 
bonded  to  the  package  paddle,  and  the  paddle  was  bonded 
to  2 1  package  ground  pins. 

Since  50  input  signals  enter  the  chip  at  1 00  MHz,  a  surface 
mount  socket  was  used  to  minimize  inductance.  A  4-layer 
PCB  was  used  in  order  to  have  a  solid  ground  plajie  and  to 
minimize  routing  of  the  100-MHz  signals.  The  board  layout 
is  shown  in  Figure  7. 

5.3.  Test  Equipment  and  Setup 

A  Keithly  sourcemcter,  a  Textron ix  500  MHz  real-time  scope, 
arbitrary  waveform  generator,  logic  analyzer  and  pattern  gen¬ 
erator  were  used  for  testing  and  chip  measurement.  The  mini¬ 
mum  output  voltage  of  the  pattern  generator  w  as  1.2  V  which 
overdrove  the  input  pads  of  the  chip.  The  I/O  pads  operated 
off  a  1-V  supply,  while  the  core  operated  off  a  separate  Q.4-V 
supply, 

6,  PERF  ORMANCE  RESULTS 

The  baseband  processor,  shown  in  Figure  7,  demonstrates 
100-Mbps  operation  in  the  sub-threshold  region  at  0.4  V  with 


M1 

45 


Fig.  7.  PC B  and  die  photo  of  baseband  processor 


an  operating  Frequency  of  25  MHz.  A  summary  of  the  per¬ 
formance  metrics  is  shown  in  Tabic  1.  As  previously  men¬ 
tioned,  we  are  pad-limited  and  consequently,  only  23%  of  the 
die  area  is  active:  the  rest  is  tilled  with  decoupling  MOS  ca¬ 
pacitors.  The  active  area  of  the  baseband  is  comparable  to  the 
total  active  area  of  the  RF  from-end  and  ADC  [3*  6]. 

The  breakdown  of  the  energy  per  bit  consumed  by  the 
baseband  is  shown  in  Figure  8,  For  a  4- kbit  packet,  the  av¬ 
erage  energy  per  bit  consumed  by  the  baseband  processor  is 
20  pJ  with  3  pJ  going  towards  acquisition  and  17  pj  going  to 
demodulation. 


Chip  Specifications 

Process  Technology 

90  nin 

Die  Size 

3.3  mm  x  3.3  mm 

Bit  Rate 

100  Mbps 

Transistor  Count 

2.6M  (2.8k  per  correlator) 

Operating  Frequency 

25  MHz 

Supply  Voltage 

0.4  V 

Power  Consumption 

Acquisition 

7  mW 

Demodulation 

1.7  mW 

Table  1.  Chip  Measurements 

By  operating  in  the  sub-threshold  region,  the  baseband 
processor  achieves  significant  energy  savings  as  compared 
with  current  state-of-the-art  UWB  baseband  transceivers.  Un¬ 
der  similar  packet  length  conditions,  this  baseband  processor 
has  a  reported  energy  per  pulse  of  less  than  1/600  of  |7[  and 
1/5  of  [S  |  (Table  2). 


13  Acquisition  Energy  ■  Demodulation  Energy 
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Fig.  8.  Breakdown  of  energy  per  hi i  consumed  by  baseband 


the  energy  required  to  switch  the  gating  transistor  and  the  re¬ 
covery  energy  required  to  bong  the  virtual  Vo  a  back  up  to 
0.4  V.  There  is  a  minimum  amount  of  time  that  the  system 
must  be  powered  off  in  order  for  power  gating  to  be  advan¬ 
tageous.  This  time,  known  as  the  break-even  time,  occurs 
when  the  savings  in  leakage  energy  is  greater  than  the  cost  of 
power  gating.  Given  that  the  leakage  power  of  the  baseband 
is  745  jtWh  the  break-even  time  was  determined  to  be  137  jas. 
A  shut-off  signal  for  power  gating  is  automatically  generated 
by  the  baseband  when  the  packet  is  completely  demodulated. 
The  tum-on  signal  could  be  generated  at  a  higher  level  (e,g. 
MAC  layer). 

Off-chip  digital  gates  were  used  to  implement  the  power 
gating  control  logic  (Figure  7).  The  off-chip  gating  transistor 
has  a  3-V  switching  voltage,  which  required  that  the  control 
logic  operate  at  3  V,  and  a  level  converter  be  used  to  interface 
the  control  logic  with  the  l-V  shut-off  signal  from  the  base¬ 
band  processor.  A  separate  3-V  supply  voltage  was  used  to 
power  this  off-chip  control  logic. 

The  instantaneous  power  of  the  vanous  states  of  operation 
is  shown  in  Figure  10.  To  obtain  this  measurement,  a  10- fi  re¬ 
sistor  was  inserted  between  the  source  me  ter  and  node  VDD* 
labeled  in  Figure  9  to  measure  the  current.  The  sourccmeter 
operated  in  the  4- wire  sense  mode  in  order  to  maintain  a  0.4- V 


6.1.  Power  Gating 

Both  forms  of  parallelism  assume  that  the  receiver  can  be 
powered  off.  Off-chip  power  gating  was  used  to  demonstrate 
this  with  the  baseband  processor  (Figure  9).  Power  gating  in¬ 
volves  gating  the  leakage  current  when  the  system  is  idle.  A 
Fairchild  NFET  w  as  used  as  the  gating  transistor. 

Realistically,  power  gating  itself  costs  energy;  specifically. 


[7] 

[8] 

This  Work 

Process  Technology 

0. 1 8  /im 

0.1 8  jim 

90  nm 

Supply  Voltage 

1.8  V 

L2  V 

0.4  V 

Data  Rate 

193  kbps 

62,5  Mbps 

100  Mbps 

Energy  Per  Pulse 

12.5  ni 

107  pj 

20  pJ 

Table  2.  Comparison  with  the  state-of-the-art 


drop  across  the  gating  transistor  and  the  baseband  processor. 
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7*  CONCLUSIONS 

Extreme  parallelism  can  be  exploited  to  reduce  acquisition 
time  in  order  to  minimize  receiver  energy,  and  to  enable  the 
use  sub-threshold  operation  for  high  performance  applications. 
The  analysis  in  this  paper  can  be  mapped  to  other  high  perfor¬ 
mance  communication  applications  using  sub-threshold  oper¬ 
ation  and  parallelism. 
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Abstract:  An  anal og-io-digi tal  converter  (ADC)  and  a 
digital  baseband  processor  for  an  ultra-wideband  ( UWB ) 
radio  receiver  perform  sampling  and  demodulation  of  100- 
Mhps  UWB  pulses.  Parallelism  is  used  to  achieve  the  high 
throughput  with  state-of-the-art  power  consumption.  The  5- 
bit  500- MS/s  ADC  consumes  only  6  mWt  and  the  digital 
processor  operates  at  0.4  V. 

Keywords:  Parallelism:  low-power:  ultra- wideband  radio; 
analog-to-digital  converter;  digital  processor 

Introduction 

Ultra-wideband  {UWB}  radio  is  an  emerging  technology 
that  shows  promise  for  very -high-data- rate  wireless 
communication  over  short  distances.  Applications  of  UWB 
include  battery-operated  devices  such  as  mobile  phones, 
handheld  devices  and  sensor  nodes.  Consequently,  there  is 
a  strong  demand  for  an  energy  efficient  UWB  system. 

The  FCC  defines  UWB  signals  as  having  10-dB 
bandwidths  greater  than  500  MHz,  and  limits  transmission 
power  density  to  less  than  -41.3  dBm/MHz  m  the  3.1- 
10.6-GHz  band  [I],  The  primary  technical  approaches  for 
high-data-rate  UWB  communication  are  OFDM  [2]  and 
pulse-based  [3]  solutions.  The  baseband  presented  in  this 
work  targets  a  custom  pulse-based  radio,  BPSK-modulated 
Gaussian  pulses  are  transmitted  at  a  pulse  repetition 
frequency  (PRF)  of  100  MHz  in  one  of  14  5  00-M  Hz-wide 
channels  within  the  UWB  band  [4],  After  down- 
conversion  by  a  direct-conversion  RF  front-end  [5]  (Figure 
1),  the  received  complex  signal  occupies  0-250  MHz. 

Parallelism  is  exploited  in  both  the  analog-to-digital 
converter  (ADC)  and  the  baseband  processor  in  order  to 
achieve  the  100-Mbps  throughput  with  minimum  power 
consumption.  In  the  ADC,  time-interleaving  allows  the  use 
of  the  energy  efficient  successive  approximation  register 
(SAR)  architecture  [6],  while  in  the  baseband  processor,  it 
enables  operation  using  an  ultra- low-voltage  supply  [7]. 

500-MS/s,  5-bit  Analog-to-Digftal  Converter 

The  250- MHz  down-converted  pulses  require  a  500-M.S/s 
Nyquist  converter,  but  the  required  resolution  is  limited  to 
4-5  bits  [8].  Flash  ADCs  are  the  typical  choice  for  this 
high-speed,  low-resolution  regime.  A  flash  converter 
compares  the  input,  in  parallel,  to  every  possible  threshold 
voltage  and  determines  the  binary  output  in  a  single  dock 
cycle.  This  use  of  voltage  parallelism  enables  the  highest- 
speed  ADC  operation,  but  it  requires  an  exponential  growth 
in  the  number  of  comparators  with  the  resolution.  This 


undesirable  complexity  characteristic  has  long  motivated 
the  choice  of  other  architectures.  The  successive 
approximation  register  (SAR)  topology  has  only  a  linear 
growth  in  the  number  of  comparisons  with  the  resolution: 
however,  it  computes  each  bit  of  the  digital  output 
sequentially  and  therefore  requires  multiple  clock  periods 
to  resolve  a  conversion,  limiting  conversion  speed.  Time- 
interleaving  [9]  uses  parallel  channels,  sampling  at  fixed 
time-intervals  to  increase  die  conversion  time  of  any  single 
channel,  permitting  use  of  the  energy  efficient  SAR 
architecture  for  this  high-speed  application  [10].  An 
energy  comparison  between  die  flash  and  time- interleaved 
SAR  architecture  is  presented  in  [  11], 

One  limitation  to  the  general  use  of  parallelism  is  the 
requirement  of  independent  processing  from  sample  to 
sample.  Successive  samples  of  any  true  Nyquist  converter, 
however,  should  be  assumed  to  be  completely  independent 
of  each  other.  Thus,  processing  the  samples  in  parallel 
should  give  identical  results  to  processing  them  serially.  In 
practice,  however,  mismatches  between  channels,  can 
negatively  impact  ADC  performance.  The  three  primary 
mismatch  concerns  are  offset,  gain,  and  riming  skew.  The 
design  of  the  time-interleaved  SAR  ADC  is  presented 
below,  specifically  addressing  our  solutions  to  mismatches. 

Top-level  Architecture  The  SAR  algorithm  requires  one 
period  to  decide  each  of  the  output  bits  plus  one  period  for 
sampling.  With  six  time- interleaved  channels,  the  internal 
channel  clock  period  matches  the  overall  sampling  clock. 
Thus,  only  one  clock  needs  to  be  generated  and  distributed. 
Besides  easing  clock  distribution  requirements,  this  also 
minimizes  timing  skew  between  channels.  A  balanced 
layout  for  this  single  sampling  clock  is  sufficient  to  reduce 
errors  arising  from  timing  skew  to  below  the  5-bit  level. 
The  top-level  block  diagram  of  the  6-channel  ADC  is 
shown  in  Figure  2.  Synchronization  is  performed  by 
passing  a  start  token  that  signals  when  a  channel  should 
begin  sampling.  This  keeps  the  overhead  associated  with 
time-interleaving  to  a  minimum. 

Channel  Circuits:  Each  channel  is  composed  of  a 
capacitive  DAC,  a  comparator,  and  digital  control  logic, 
often  referred  to  as  the  SAR  itself.  The  DAC  is  the  split 
capacitor  array  [12],  which  features  decreased  switching 
energy  and  faster  switching  speed  than  the  conventional 
binary  weighted  capacitor  array.  Gain  mismatch  between 
channels  is  limited  by  capacitor  matching,  and  the  unit 
capacitor  size  is  thereby  chosen  conservatively. 


Figure  1.  UWB  direct  conversion  receiver  block  diagram.  Baseband  is  highlighted. 


Figure  2*  Block  diagram  of  6-way  time-interleaved 
SARADC. 


Figure  3.  Block  diagram  of  the  SAR  channel. 

The  comparator  uses  a  two  stage  auto  zeroed  preamplifier 
and  a  regenerative  latch.  The  preamplifiers  reduce  the 
large  offset  voltage  of  the  latch  to  below  one  quarter  of  the 
LSB  voltage  when  referred  to  the  input  of  the  entire 
comparator  chain,  sufficient  to  limit  offset  mismatch.  All 
of  the  transistors  in  the  comparator  have  longer  than  the 


minimum  channel  length  in  order  to  improve  matching  and 
output  impedance, 

implementation  and  Measured  Results:  The  ADC  has  been 
fabricated  in  a  65-nm  CMOS  process,  A  photograph  of  the 
1,9  x  L4  mm  die  is  shown  in  Figure  4,  The  input  and  clock 
paths  use  a  fully  balanced  layout  in  the  middle  of  the  die. 

The  effect  of  mismatch  between  channels  can  be  seen  in 
die  FFT  in  Figure  5.  Distortion  from  the  measured  0.3VUSU 
offset  variation  appears  as  spurs  (eHf)  at  multiples  of  the 
channel  sampling  frequency.  Spurs  (aH’d)  arise  from 
timing  and  gain  errors*  dominated  by  the  former  in  this 
implementation.  The  measured  gain  error  is  0,9%,  The 
ADC  achieves  full  Nyquist  operation*  with  the  effective 
number  of  bits  dropping  from  4J  at  DC  to  4  at  Nyquist, 
The  measured  6-mW  power  consumption  is  split  roughly 
evenly  between  the  analog  and  digital  supplies.  The  ADC 
performance  summary'  is  listed  in  Tabic  L 
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Figure  4.  Die  photograph  of  ADC, 


Table  1*  Summary  of  ADC  Performance 


Technology 

65-nm  CMOS  1P6M 

Supply  Voltage 

1.2  V 

Sampling  Rate/Resolution 

500  MHz/5  bit 

SNDR  (ftn  =  239  MHz) 

26.1  dB 

DNL/INL 

0.16/0.26  LSBs 

Power  (analog/digital) 

2.86/3.06  mW 

Active  Area 

0.65  mm  x  1 ,4  mm 

Frequency  (MHz) 

Figure  5.  FFT  of  239  MHz  input  with  dominant  spurs 
labeled. 


UWB  Baseband  Processor 

The  digital  baseband  processor,  shown  in  Figure  6,  receives 
a  500-MS/s  signal  from  the  ADC  and  performs  acquisition 
and  demodulation.  The  packet  structure  of  the  received 
signal  is  shown  in  Figure  7,  The  preamble  contains 
repetitions  of  a  3 1  -bit  PN  sequence  with  a  PRF  of  25  MHz, 
while  the  payload  contains  the  actual  data  that  is  sent  at  a 
PRF  of  100  MHz  for  a  100-Mhps  data  rate  with  no  channel 
coding.  During  acquisition,  a  correlator  is  used  to  compute 
the  cross-correlation  function  between  the  incoming  noisy 
preamble  and  a  clean  template  of  the  3 1 -bit  PN  sequence. 
Peak  detection  is  performed  on  the  cross-correlation  to 
achieve  signal  acquisition.  Demodulation  is  then 
performed  on  the  payload  with  a  5- fingered  RAKE 
receiver,  and  a  hard  decision  is  made  at  the  output  of  the 
maximum  ratio  combiner  (MRC)  to  resolve  a  bit. 


Figure  6,  Block  diagram  of  parallelized  digital 
baseband  processor 
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Figure  7,  UWB  physical-layer  packet  format 


The  energy  of  the  baseband  processor  is  reduced  by 
aggressively  scaling  down  its  supply  voltage  (VDD)  such 
that  the  correlator  operates  near  its  minimum  energy  point 
[13].  The  minimum  energy  point  occurs  because  the  total 
energy  per  operation  is  composed  of  dynamic  energy  and 
leakage  energy. 


F  -  E  +E  -C  V1  +/  V  T 

c  total  ^  dy  namic  ^  c  leakage  ^  tff  DD  ^  1  (tak  Y  DD  *  May 


From  the  above  equation,  we  see  that  lowering  VDD 
decreases  the  dynamic  energy.  While  reducing  VQD 
reduces  the  leakage  power,  it  also  increases  the  delay 
(Tdctay)  of  the  gates.  When  the  VDD  is  above  the  threshold 
voltage  of  the  device,  the  delay  increases  linearly  w  ith  VDD, 
and  there  is  no  significant  change  in  the  leakage  energy; 
however,  when  VDD  drops  below  the  threshold  voltage  of 
the  device,  both  delay  and  leakage  energy  increase 
exponentially.  Since  the  dynamic  energy  and  leakage 
energy  scale  in  opposite  directions  as  decreases,  a 
minimum  energy  point  occurs  in  the  sub-threshold  region. 

Simulations  performed  on  the  correlator,  designed  in  a 
standard-VT  90-nm  CMOS  process,  indicate  that  the 
minimum  energy  point  occurs  at  0,3  V,  which  gives  a  9x 
energy  reduction  as  compared  to  the  full-scale  l-V 
operation  (Figure  8)  [7],  Ideally,  it  would  be  desirable  to 
scale  VDD  such  that  the  baseband  operates  at  this  minimum 
energy  point. 

For  real-time  acquisition  and  demodulation  of  a  UWB 
packet,  the  baseband  processor  must  perform  signal 
processing  with  a  throughput  of  500  MS/s.  This  can  be 
achieved  by  a  single  correlator  operating  at  a  frequency  of 
500  MHz  with  a  much  higher  voltage  than  0.3  V,  but  we 
have  shown  that  this  is  not  energy  efficient.  Instead,  it  is 
better  to  operate  at  a  lower  voltage  with  a  reduced 
frequency,  and  utilize  parallelism  in  the  baseband  processor 
to  meet  the  throughput  constraint. 

In  order  to  refrain  from  introducing  additional  complexity 
due  to  parallelism,  it  is  preferable  that  the  operating 


frequency  be  a  factor  of  the  preamble’s  PRF  (25  MHz) 
such  that  an  integer  number  of  pulses  are  processed  per 
clock  cycle.  The  operating  frequency  is  equal  to  25  MHz  if 
the  supply  voltage  is  raised  slightly  to  0.4  V.  Since  the 
minimum  energy  point  is  shallow,  this  slight  change  in  VDD 
does  not  cause  a  significant  energy  penalty.  By  operating 
at  0,4  V  rather  than  i  V,  the  energy  per  operation  is 
reduced  by  almost  6x.  At  25  MHz,  the  correlators  need  to 
be  parallelized  by  a  factor  of  20  in  order  to  maintain  the 
500-MS/s  throughput. 

In  addition,  the  MRC  of  the  RAKE  receiver  is  parallelized 
by  a  factor  of  4  such  that  it  can  operate  off  the  same  supply 
voltage  and  operating  frequency  as  the  rest  of  the  baseband 
processor.  This  also  reduces  the  energy'  required  for 
demodulation.  Combining  parallelism  with  ultra-low- 
voltage  operation  delivers  energy  savings  for  receiving  the 
entire  UWB  packet. 

Conclusion 

An  energy-efficient  baseband  for  a  UWB  radio  has  been 
presented.  Parallelism  has  enabled  the  very  low  power 
consumption  in  this  high  performance  application.  The  low 
complexity  but  high  latency  successive  approximation 
register  ADC  architecture  is  combined  with  time¬ 
interleaving  to  achieve  the  desired  throughput  and 
performance  in  deep-submicron  CMOS,  The  highly 
parallelized  packet  acquisition  leads  to  a  significant 
reduction  in  operating  voltage.  The  baseband  presented 
here  can  be  integrated  m  a  single-chip  solution  in  a  highly 
energy-efficient  manner  by  increasing  the  number  of  time- 
interleaved  ADC  channels.  With  20  channels,  each 
channel  w  ould  directly  feed  one  set  of  correlator  banks,  and 
the  channels  themselves  could  translate  the  increased 
conversion  into  reduced  operating  voltages  for  further 
energy  savings. 

Acknowledgments 

This  work  is  funded  by  DARPA,  an  NDSEG  Fellowship, 
and  an  NSERC  Fellowship,  The  authors  would  like  to 


thank  Texas  Instruments  and  STMicroelectronics  for 

fabrication  services. 

References 

[1]  Federal  Communications  Commission,  “Ultra- 
wideband  first  report  and  order,"  FCC  02-48,  Feb.  2002. 

[2]  A.  Batra,  et  at “Multi-band  OFDM  physical  layer 
proposal  for  EEEE  802.15  Task  Group  3a,"  IEEE 
P802, 1 5 -04/049 3 r05  Sept.  2004. 

[3]  R,  Fisher,  et  at.,  “D5-UWB  physical  layer  submission 
to  802.15  Task  Group  3a.”  IEEE  P802,15-O4/0l37r3„ 
July  2004, 

[4]  D.  D,  WcntzlofT,  et  a/,,  “System  design  considerations 
for  ultra-wideband  communication  ”  IEEE  Commun > 
Mag.,  vol.  43,  no.  8,  pp,  1 14-121,  Aug.  2005. 

[5]  F.  S.  Lee  and  A,  P,  Chandrakasan,  “A  BiCMOS  ultra- 
wideband  3.1-1 0.6-GHz  front-end  "IEEE J.  Solid-Stale 
Circuits,  vol.  4 1,  no.  8,  pp.  1 784— 1 79 1,  Aug.  2006. 

[6]  B.  P.  Ginsburg  and  A.  P.  Chandrakasan,  “A  500MS/s 
5b  ADC  in  65 nm  CMOS,”  in  Symp.  on  VLSI  Circuits 
Dig ,  of  Tech  Papers,  June  2006,  pp,  174-175. 

[7]  V.  Sze,  et  aL,  “An  energy  efficient  sub-threshold 
baseband  processor  architecture  for  pulsed  ultra- 
wideband  communications,1*  IEEE  InL  Conf  an 
Acoustics .  Speech  and  Signal  Processing,  May  2006, 
pp,  (  III)  908-911. 

[8]  P.  P.  Newaskar,  R.  Blazquez,  and  A.  P.  Chandrakasan, 
“A/D  precision  requirements  for  an  ultra-wideband 
radio  receiver,”  in  IEEE  Workshop  on  Signal 
Processing  Systems,  Oct  2002,  pp.  270-275. 

[9]  W.  Black  and  D.  Hodges,  “Time  interleaved  converter 
arrays,"  IEEE  J .  Solid-State  Circuits ,  vol.  1 5,  no.  6,  pp. 
929-938,  Dec.  1980. 

[10]  D.  Draxelmayr,  “A  6b  600MHz  IQmW  ADC  array  in 
digital  90nm  CMOS,**  in  ISSCC  Dig  Tech.  Papers, 
Feb.  2004,  pp.  264-265. 

[U]B,  P.  Ginsburg  and  A.  P,  Chandrakasan,  “Dual  time- 
interleaved  successive  approximation  register  ADCs 
for  an  ultra- wideband  receiver,”  IEEE  J.  Solid-State 
Circuits  f  vol  42,  Feb.  2007,  to  be  published. 

[12]  B,  P.  Ginsburg  and  A.  P.  Chandrakasan,  “An  energy- 
efficient  charge  recycling  approach  for  a  SAR 
converter  with  capacitive  DAC”  m  Proc.  of  the  IEEE 
Ini  Symp ,  on  Circuits  and  Systems*  vol.  1 ,  May  2005, 
pp.  184-187. 

[13]  B.  Calhoun,  A.  Wang  and  A.  P.  Chandrakasan, 
“Modeling  and  sizing  for  minimum  energy  operation 
in  sub-threshold  circuits,”  in  IEEE  1  Solid-State 
Circuits ,  vol.  40,  no.  9,  pp.  1778-1786,  September 
2005. 


Journal  Papers 


IEEE  JOURNAL  OF  SOLID-STATE  CIRCUITS.  VOL.  41.  NO.  7.  JULY  2006 


Static  Noise  Margin  Variation  for  Sub-threshold 
SRAM  in  65-nm  CMOS 

Benton  H,  Calhoun,  Member,  IEEE ,  and  Anantha  P  Chandrakasan,  Fellow,  IEEE 


Abstract- — The  increased  importance  of  lowering  power  in 
memory  design  lias  produced  a  trend  or  operating  memories  at 
lower  supply  voltages.  Recent  explorations  into  sub -threshold 
operation  for  logic  show  that  minimum  energy  operation  is 
possible  in  this  region.  These  two  trends  suggest  a  meeting 
point  for  energV'Constrained  applications  in  which  SRAM  oper¬ 
ates  at  sub-threshold  voltages  compatible  with  the  logic.  Since 
suh- threshold  voltages  leave  less  room  for  large  static  noise  margin 
(SNM),  a  thorough  understanding  of  the  impact  of  various  design 
decisions  and  other  parameters  becomes  critical.  This  paper 
analyzes  SNM  for  sub-threshold  biteelis  in  a  65-nm  process  for 
its  dependency  on  sizing,  Vbn  *  temperature,  and  local  and  global 
threshold  variation.  The  VT  variation  has  the  greatest  impact  on 
SNM,  so  we  provide  a  model  that  allows  estimation  of  the  SNM 
along  the  worst-case  tail  of  the  distribution. 

Index  Terms — Sub-threshold,  sub-threshold  memory,  SRAM, 
static  noise  margin,  process  variation,  voltage  scaling. 


1.  Introduction 

SUB-THRESHOLD  digital  circuit  design  has  emerged  as 
a  low  energy  solution  for  applications  with  strict  energy 
constraints.  Analysis  of  sub-threshold  designs  has  focused  on 
logic  circuits  (e-g.,  [ l  ] >.  SRAMs  comprise  a  significant  per¬ 
centage  of  the  total  area  for  many  digital  chips  as  well  as  the  total 
power  [2],  [3J.  For  this  reason,  SRAM  leakage  can  dominate 
the  toial  leakage  of  the  chip,  and  large  switched  capacitances  in 
the  bitlines  and  wordlines  make  SRAM  accesses  costly  in  terms 
of  energy.  Pushing  SRAM  operation  into  the  sub -threshold  re¬ 
gion  reduces  both  leakage  power  and  access  energy.  Also,  for 
system  integration,  SRAM  must  become  capable  of  operating 
at  sub-threshold  voltages  that  are  compatible  with  sub-threshold 
combinational  logic.  Reeenl  low-  power  memories  show  a  trend 
of  lower  voltages  with  some  designs  holding  state  on  the  edge 
of  the  sub-threshold  region  (e.g.,  [4]).  This  scaling  promises 
to  continue,  leading  to  sub-threshold  storage  modes  and  even 
sub-threshold  operation  for  SRAMs  operating  in  tandem  with 
sub-threshold  logic. 

When  the  bitcell  is  holding  data,  its  wordline  is  low  so  the 
nMOS  access  transistors  are  off.  In  order  to  hold  its  data  prop¬ 
erly,  ihc  back-to-back  inverters  must  maintain  bi-stable  oper¬ 
ating  points.  The  best  measure  of  the  ability  of  these  inverters 
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to  maintain  their  state  is  the  bitcelFs  static  noise  margin  (SNM) 
[5],  The  SNM  is  the  maximum  amount  of  voltage  noise  lhat  can 
be  introduced  at  the  outputs  of  the  two  inverters  such  that  the 
cell  retains  its  data.  SNM  quantifies  the  amount  of  voltage  noise 
required  at  the  internal  nodes  of  a  bitcell  to  Hip  the  celLs  con¬ 
tents. 

Fig.  I  shows  a  conceptual  setup  for  modeling  SNM  [5].  Noise 
sources  having  value  Vy  are  introduced  at  each  of  the  internal 
nodes  in  the  bitcell.  As  Vy  increases,  the  stability  of  the  cell 
changes.  Fig,  2  shows  the  most  common  way  of  representing  the 
SNM  graphically  for  a  bitcell  holding  data.  The  figure  plots  the 
voltage  transfer  characteristic  (VTC)  of  Inverter  2  from  Fig,  I 
and  the  inverse  VTC  from  Inverter  I.  The  resulting  Iworiobed 
curve  is  called  a  “butterfly  curve"  and  is  used  to  determine  the 
SNM.  The  SNM  is  defined  as  the  length  of  the  side  of  the  largest 
square  lhat  can  be  embedded  inside  the  lobes  of  the  butterfly 
curve  [5].  To  understand  why  this  definition  holds,  consider  the 
case  when  the  value  of  V\  increases  from  0.  On  the  plot,  this 
causes  the  VTC"1  for  Inverter  1  in  the  figure  to  move  down¬ 
ward  and  the  VTC  for  Inverter  2  to  move  to  the  right.  One c  they 
both  move  by  the  SNM  value,  the  curves  meet  at  only  two  points. 
Any  further  noise  flips  the  cell. 

Although  the  SNM  is  certainly  important  during  hold,  cell 
stability  during  active  operation  represents  a  more  significant 
limitation  to  SRAM  operation.  Specifically,  at  the  onset  of  a  read 
access,  the  wordline  is  “I11  and  the  bitlines  are  still  precharged 
to  UT  as  Fig,  3  illustrates.  The  internal  node  of  the  bitcell  that 
represents  a  zero  gets  pulled  upward  through  the  access  tran¬ 
sistor  due  to  the  voltage  dividing  effect  across  the  access  tran¬ 
sistor  (A/2,  M§)  and  drive  transistor  (M\.  A/4).  This  increase 
in  voltage  severely  degrades  the  SNM  during  the  read  operation 
(read  SNM).  Fig.  4  shows  example  butterfly  curves  during  hold 
and  read  that  illustrate  Lhe  degradation  in  SNM  during  read. 
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Fig.  2.  The  length  of  the  side  of  the  largest  embedded  square  in  the  butterfly  curve  is  the  SNM.  When  both  curves  move  by  more  than  this  amount  (e,g.t  \  's  = 
SNM)*  then  the  butell  is  mono  stable,  losing  its  data. ) 


Fig.  3.  Schematic  of  the  frT  bitcdl  at  the  onset  of  a  read  access.  WL  has  jusi 
gone  high,  and  both  BLs  are  prechitrgecl  to  V"i > K> .  The  voltage  dividing  effect 
across  \  f\  and  V/5  pulls  up  node  Qp.  which  should  be  0  V,  and  degrades  the 
SNM. 


Fig.  4. 


the  tail  of  the  probability  density  function  (PDF)  that  dominates 
SNM  failures  £6|. 

The  minimum  voltage  for  retaining  bistability  was  theorized 
in  [7 1  and  modeled  for  SRAM  in  [8],  but  degraded  SNM  can 
limit  voltage  scaling  for  SRAM  designs  above  this  minimum 
voltage,  SNM  quantifies  the  amount  of  voltage  noise  required 
at  the  internal  nodes  of  a  bitcdl  to  flip  the  cell's  contents. 

An  expression  for  above- threshold  SNM  based  on 
long-channel  models  is  given  in  [5],  and  |9|  models 
above-threshold  SNM  for  modem  processes  with  process 
variation.  This  section  builds  on  previous  work  by  examining 
SNM  for  sub- threshold  SRAM  [6]. 

A.  Modeling  Sub-Threshold  Static  Noise  Margin 

Lowering  Vbo  reduces  gate  current  much  more  rapidly  than 
sub- threshold  current,  so  total  current  in  the  sub- threshold  re¬ 
gion  can  be  modeled  to  first  order  as 

<■> 

The  sub-threshold  factor  n  =  1  +  Qt/C fJJ7,  VIh  =  kTjq , 
and  Is  is  the  current  when  Vqs  equals  Vr ■  For  simplicity,  we 
Lreat  pMOS  parameters  as  positive  values.  For  the  65-nm  tech¬ 
nology  used  in  this  section,  the  nMOS  drive  current  is  higher 
in  above-threshold  than  the  pMOS  for  iso- width,  but  the  pMOS 
current  is  higher  in  sub-threshold  due  to  its  lower  Vp.  During 
hold  mode,  the  wordline  is  low  so  M2  and  Mr,  have  VGS  <  0  and 
thus  negligible  current.  We  can  model  the  cell  VTCs  (  Vqut  = 
/vtc(Vin))  as  those  of  a  simple  inverter  in  sub-threshold. 


Q(V) 

Example  butterfly  curve  plots  for  SNM  during  hold  and  read. 


N,  Static  Noisf.  Margin 

This  section  evaluates  the  SNM  of  six -transistor  (6T)  SRAM 
bitcclls  operating  in  sub-threshold  We  analyze  the  dependence 
of  SNM  during  both  hold  and  read  modes  on  supply  voltage, 
temperature,  transistor  sizes,  local  transistor  mismatch  due  to 
random  doping  variation,  and  global  process  variation  in  a  com¬ 
mercial  65-nm  technology.  We  analyze  the  statistical  distribu¬ 
tion  of  SNM  with  process  variation  and  provide  a  model  for 
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Referring  to  Fig,  2,  (2)  [7]  gives  the  inverse  VTC  for  inverter 
1  (Fin  -  f vtc ( F out ) ) ■  The  inverse  of  (2)  is  given  in  [  10]  for 
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Fig.  5.  First-order  VTC  equations  versus  simulation.  Line  A  is  12),  line  B  is 
(3),  l  ine  C  is  a  piecewise  combination  of  (5)  and  ( 2),  and  line  D  is  a  piecewise 
combination  of  (3)  and  the  graphical  inverse  of  (5) 


matched  pMOS  and  nMOS  (same  n,  V'r,  /$).  We  give  a  full 
solution  for  Vqut  —  Atc(Vln)  for  inverter  2  in  (3): 


Vqb  =  Vdd  +  V^  in 


I  —  G+  ^ (G  -  l)2  +  4 exp  < 


V 


(3) 


G 


_  f  rc4+ns  Isa  Vpp  1  Art  VreY) 

”  LXPVM«BV'(h  Q  "  Is4  rtcV'(/l  \  n4  «fi  //’ 


(4) 


Fig*  5(a)  plots  (2)  and  (3)  against  simulation  curves  for  no 
local  mismatch  and  for  [a  VT  mismatch  in  A/ti. 

During  a  read  access,  the  word  line  goes  high  and  the  bitlines 
are  precharged  to  V'dd  so,  if  Vq  =  0  prior  to  access*  M\  and 
Nix  are  both  on.  This  creates  a  voltage  division  that  raises  the 
voltage  at  Q.  Assuming  pMOS  current  is  negligible  in  the  region 
of  interest,  (5)  shows  the  inverse  VTC  equation  during  a  read 
operation  near  the  SNM  [4]  for  inverter  1; 


Vqb  =  m  Vth  In  ^  +  niVth  In 


,  ‘-“p(-fir) 


+  V'T1  d - “(Vdd  “  ^T2  “  Vq)*  (5) 

Tlx 


This  equation  cannot  be  inverted  analytically,  and  it  applies 
only  to  the  region  of  the  VTC  where  Tout  is  low.  Fig.  5(b) 
shows  (5)  and  its  graphical  inverse  combined  piecewise  with  (2) 
and  (3)  and  plotted  against  simulation  for  no  local  mismatch  and 
for  ler  VT  mismatch  in  M\  for  minimum  device  sizes  at  25  °C. 

Graphical  or  numerical  solutions  for  SNM  are  easily  derived 
from  the  VTC  equations,  although  no  direct  analytical  solution 
exists.  The  equations  provide  a  good  estimate  of  the  behavior  of 
the  SNM  based  on  key  parameters*  One  shortcoming  of  (2M5) 
is  the  assumption  that  sub-threshold  slope  (S  =  nVth  In  10)  is 
constant  for  each  transistor.  Fig.  6(a)  shows  that  S  varies  with 
Vcs ,  and  Fig,  6(b)  shows  S  changing  with  temperature  without 
the  expected  constant  slope  due  to  Vt h .  A  more  crucial  problem 
with  (2M5)  is  the  assumption  that  certain  currents  are  negli¬ 
gible.  These  assumptions  break  down  under  certain  combi  na¬ 


Fig.  6,  Changes  in  sub-threshold  slope  ( S )  versus  (a)  Vcs  and  (b)  temperature. 
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□(V) 

with  varying  Von. 


t ions  of  Vt  variation*  rendering  the  first-order  equations  inac¬ 
curate, 

8 .  Sub-Th  resho id  SN M  Dependen ties 

With  embedded  SRAM  often  providing  multiple  megabits 
of  storage,  the  SNM  of  ihe  nominal  bitcell  becomes  largely 
irrelevant.  Variations  in  processing  and  in  the  chip's  environ¬ 
ment  create  a  distribution  of  SNM  across  the  bitcells  in  a  given 
memory*  and  the  worst -case  tail  of  this  distribution  determines 
the  yield.  This  section  examines  the  impact  of  different  parame¬ 
ters  on  SNM  in  sub-threshold  and  offers  a  model  for  estimating 
the  tail  of  the  SNM  density  function  for  process  variation. 

SNM  for  a  bitcell  with  ideal  VTCs  is  still  limited  to  Vdd/2 
because  of  the  two  sides  of  the  butterfly  curve.  An  upper  limit 
on  the  change  in  SNM  with  V'dd  thus  1/2-  Fig.  7  shows  ex¬ 
ample  butterfly  curves  at  different  supply  voltages  from  l  .2  V 
to  200  mV  for  both  hold  and  read.  Fig,  8  plots  SNM  versus  V'dd 
directly  for  both  hold  and  read  mode.  The  slopes  of  the  curves 
confirm  that  less  than  1/2  of  V'dd  noise  will  translate  into  SNM 
changes* 

The  impact  of  temperature  on  SNM  in  sub-threshold  is 
also  not  large.  Fig.  9  shows  SNM  versus  temperature  in 
sub- thresh  old  and  again  for  strong  inversion.  The  sensitivity  in 
sub-threshold  is  lower,  and  varying  temperature  from  -40 °C 
to  125  °C  only  alters  Read  and  Hold  SNM  by  21  mV  and  6  mV, 
respectively  Higher  temperatures  lower  SNM  in  sub-threshold 
due  to  the  degraded  gain  in  the  inverters  that  results  from 
worse  sub- threshold  slope  (see  Fig.  6(b)).  Also,  pMOS  devices 
weaken  relative  to  nMOS  at  higher  temperature.  Fig.  10  pro¬ 
vides  example  butterfly  plots  for0°C  and  100  °C  at  1.2  V  and 
0.3  V. 


1676 


IEEE  JOURNAL  OF  SOLID-STATE  CIRCUITS.  VOL  41.  NO  7.  JULY  2006 


1  1.5  2  2.5  3  3.5  4  4.5  5 

Cell  Ratio 


Fig  8.  SNM  versus  V‘du. 


Fig.  I  L  Celt  ratio  affects  SNM  less  in  sub- threshold. 
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Fig  9  SNM  versus  lenipcrature 


Fig.  10.  VTO  during  a  read  access  across  temperature. 


In  contrast  to  above-threshold  [it],  Fig,  11  shows  that 
cell  ratio  l{WfL)if{WfL)2  or  (W/L)4/{W/L)s)  has  very 
little  impact  on  SNM  during  sub-threshold  read.  In  fact, 
sub- threshold  SNM  sensitivity  to  any  sizing  changes  is  re¬ 
duced.  The  lower  impact  of  sizing  is  intuitively  reasonable 
considering  the  exponential  dependence  of  sub-threshold  cur¬ 
rent  on  other  parameters.  Mathematically,  we  can  see  from 
(2M5)  that  sizing  changes  affect  Is*,  linearly  and  only  have 
a  logarithmic  impact  on  the  VTCs.  One  point  of  caution  here 
is  that  Vr  for  dcep-submicron  devices  tends  to  vary  with  size 
as  a  result  of  narrow  or  short  channel  effects.  The  impact  of 
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Dependence  of  SNM  high  on  single  FETs  is  nearly  linear 


this  Vt  change  that  might  accompany  a  sizing  change  is  more 
pronounced.  These  effects  depend  on  the  technology  and  make 
general  SNM  modeling  more  complicated. 

C.  Dependence  on  Random  Doping  Variation 

The  randomness  of  the  number  of  doping  atoms  and  their 
placement  in  a  MOSFET  channel  causes  random  mismatch  even 
in  transistors  with  identical  layout  [12].  The  impact  on  threshold 
voltage,  whose  a  is  proportional  to  {W  t  is  the  worst 

for  minimum  sized  devices  which  are  common  in  SRAM.  Local 
variation  is  a  huge  problem  for  SRAM  functionality,  and  it  ts  the 
subject  of  many  papers  (e.g. ,  [ !  3 ] ,  [  1 4]),  The  exponential  depen¬ 
dence  of  current  on  VT  in  sub- threshold  operation  makes  this 
random  variation  even  more  influential.  Furthermore,  the  large 
number  of  bitcells  in  many  SRAMs  makes  the  tails  (Sfr-tia) 
of  the  PDF  more  critical  for  modeling  since  the  extreme  cases 
are  the  limiting  factor  for  yield.  Previous  work  has  shown  that 
above-th  res  hold  SNM  is  nearly  linear  with  IV  -  and  modeling  its 
slope  as  constant  allows  an  approximation  of  the  joint  PDF  for 
SNM  [9].  Likewise,  the  sensitivity  of  above-threshold  SNM  to 
VT  is  linearized  for  each  transistor  in  [I5|. 
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Fig.  1 3  Dependence  of  SNM  high  on  a  single  FET  depends  on  other  V  rs  in 
(a)  sub- threshold,  unlike  for  (b)  above- threshold. 


Fig,  15,  Scatter  plots  for  SNM  high  versus  SNM  low  with  single  FET  depen¬ 
dencies  overlaid  in  white. 


Fig  14  SNM  high  and  low  (not  shown)  for  (at  a  minimum  sized  cell  and  for 
(b)  4  *  H  L  is  normally  distributed  with  random  I  V  mismatch  in  all  transistors. 


the  Cumulative  Distribution  Function  (CDF)  for  AV,  the  PDF  of 
the  minimum  of  two  iid  variables  is  given  in  (6): 

Hmm(XuX2))  =  2fx(l-Fx),  (6) 

Although  SNM  high  and  SNM  low  are  normally  distributed 
with  approximately  the  same  mean  and  variance,  wc  have  pre¬ 
viously  shown  that  they  are  not  independent.  However,  we  are 
less  interested  in  modeling  the  enure  PDF  for  SNM  than  we  are 
in  modeling  the  worst -ease  tail.  As  previously  staled,  the  tail  to¬ 
ward  lower  SNM  is  the  limiting  factor  Let  us  assume  that  they 
are  iid.  Then  we  can  solve  for  the  PDF  as 


Fig.  12  shows  that,  like  in  strong  inversion,  the  sensitivity  of 
SNM  high  (the  upper- left  box  in  Fig,  4)  is  nearly  linear  with 
each  individual  VT.  However,  Fig.  13(a)  shows  the  relationship 
between  SNM  and  Vta  for  a  few  different  random  values  of  Lhc 
other  l Vs.  The  obvious  dependence  of  the  slope  on  the  other 
VVs  prevents  using  a  model  of  the  form  SNM  -  SNM0  + 

c*  ^ Ti  for  sub-threshold  SNM.  The  same  is  not  true  of  above- 
threshold,  shown  in  Fig,  13(b),  for  which  a  first  order  series 
model  works  well  [9],  [15]. 

Fig,  14  shows  the  results  of  5  k-point  Monte  Carlo  (M-C) 
simulations  with  random  independent  VV  mismatch  in  all  tran¬ 
sistors,  These  histograms  confirm  that  sub- threshold  SNM  at 
the  upper  lobe  of  the  butte rtly  curve  (SNM  high)  is  normally 
distributed.  The  solid  lines  show  a  fitted  Gaussian  PDF,  and 
the  markers  show  simulation  results.  Larger  sizes  for  the  bitcell 
clearly  have  the  advertised  effect  of  lowering  the  variance  of  VV 
as  seen  in  Fig.  14(b),  The  SNM  low  PDFs  are  very  similar.  The 
scatter  plot  in  Fig.  15  shows  that  SNM  high  and  SNM  low  are 
correlated.  The  dependencies  for  mismatch  in  each  single  tran¬ 
sistor  are  overlaid  in  white  for  reference.  The  Hold  SNM  shows 
a  saturation  elfect  along  the  upper  edge.  SNM  high  and  SNM 
low  are  not  independent  because  any  change  to  a  VTC  that  in¬ 
creases  the  SNM  at  one  side  tends  to  decrease  SNM  at  the  other 
side. 

The  actual  SNM  that  matters  for  a  bitcell  is  the  minimum  of 
SNM  high  and  SNM  low.  Thus,  the  random  variable  A'snm  = 
miii(XsNMhigh:  ^SNMkiw)-  Order  statistics  can  provide  us  with 
the  PDF  for  the  minimum  of  n  independent,  identically  dis¬ 
tributed  {iid)  random  variables,  Xt.  If  /  is  the  PDF,  and  F  is 


/SNM  =  2/sNMhigh(l  ~  (7) 

and  the  CDF  is  simply 

^SNM  =  2FsNMhiKh  -  (jPsMMhigh)2-  (8) 

Fig,  16  shows  the  histogram  for  a  5  k-point  M-C  simulation 
of  Read  SNM  plotted  on  linear  axes  (a)  and  semilog  axes  (b). 
Clearly,  SNM  is  not  normally  distributed,  and  its  mean  is  lower 
than  the  mean  of  SNM  high  and  SNM  low.  Fig,  16(b)  shows 
that  a  Gaussian  PDF  does  not  match  the  worst-case  tail  on  the 
left  side  of  the  PDF.  On  the  other  hand,  the  PDF  based  on  (7) 
provides  a  good  estimate  of  the  worst-case  tail.  The  plot  shows 
that  the  model  does  not  fitt  he  distribution  above  the  mean.  This 
shortcoming  results  from  the  correlation  between  SNM  high  and 
SNM  low.  Since  these  two  random  variables  are  not  iid ,  we 
cannot  claim  that  the  minimum  model  will  al  ways  match  the  ta i  1 L 
However,  we  can  show  experimentally  that  it  does  offer  a  good 
estimate.  Thus,  the  model  is  a  useful  tool  tor  evaluating  SNM 
under  different  design  decisions  and  conditions.  This  PDF  gives 
the  powerful  option  of  estimating  the  SNM  al  the  worst -case  end 
of  the  PDF  without  using  extremely  long  M-C  simulations  until 
the  design  space  is  narrowed  sufficiently. 

Fig,  17  shows  several  estimated  PDFs  using  (7)  that  are  based 
on  data  sets  of  different  lengths.  These  estimates  are  plotted 
over  a  50  k-point  M-C  simulation.  A  1000-poinl  M-C  simulation 
gives  a  modeled  distribution  that  overlays  the  modeled  distribu¬ 
tion  from  the  50  k-point  case  on  the  plot  (<3%  error).  Using 
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Fig.  (  6  (a)  Histogram  of  SNM  Mome  Carlo  simulation  (circles)  with  rtoim;iI 
PDF  (dash)  and  PDF  based  on  (7)  (solid)  over- laid.  The  semi  log  plot  (b)  shows 
that  the  PDF  based  on  (7)  matches  the  worst-case  tail  quite  well 


SNM  (V) 


Fig.  IS  Monte  Carlo  simulation  showing  global  v;irotion  impact  on  SNM  lor 
a  minimum  sized  bitcell. 


Fig.  19.  SNM  Mome  Carlo  simulations  for  local  mismatch  on  lop  of  global 
variation. 


Fig.  17.  50k-point  Monte  Carlo  simulation  for  SNM  with  4*  VF£  sized  transis¬ 
tors.  Model  based  on  I  k-point  Monte  Carlo  data  matches  the  50  k-pomt  model 
with  <3%  error 


this  approach  allows  a  designer  to  reliably  estimate  the  tail  of 
the  SNM  PDF  for  a  large  memory  with  relatively  few  samples. 

Thus  far  we  have  assumed  that  device  mismatch  occurs  in 
transistors  that  start  off  as  typical  for  the  process.  In  addition  to 
the  mter-die  Vj*  mismatch  that  we  have  described  is  an  intra-die 
process  variation  that  sets  the  process  comer  (e.g.,  fast  nMOS, 
slow  pMGS,  etc.).  Even  for  no  mismatch,  the  process  comer 
impacts  the  SNM.  Fig,  18  shows  the  SNM  PDF  for  a  minimum 
sized  6T  bitcell  from  a  M-C  simulation  of  global  process  comer 
in  which  nine  process  parameters  are  varied.  Here  again,  the  tail 
of  the  PDF  is  the  limiting  factor. 

In  a  production  framework,  each  die  containing  a  given 
SRAM  will  have  a  global  process  comer  that  affects  SNM  as  in 
Fig.  18.  On  top  of  this,  mismatch  in  each  cell  will  result  from 
random  doping  variation.  Assuming  that  any  die  within  3#  of 
the  mean  is  usable,  we  found  the  global  process  comer  that 
gives  an  SNM  yield  with  the  same  probability  as  -3 a  for  both 
hold  and  read  cases.  Fig.  19  shows  that  the  impact  of  mismatch 
at  this  3<t  process  comer  is  essentially  to  shift  the  mean  of 
the  PDF  by  the  offset  caused  by  global  variation.  This  means 
that  the  models  we  have  presented  remain  valid  for  the  case  of 
combined  global  and  local  variation.  Fig.  20  shows  the  semilog 
plot  of  the  distributions  to  confirm  this  conclusion. 


Read  SNM  (V) 


-  Model  :  nogfobat  variation 

O  Monte-Carlo  :  no  global  varn 
r  — -  Model  :  3a  global  varn 
|  Q  Monte-Carlo  :  3g  global  varn 


Fig.  20.  SNM  Monte  Carlo  simulations  For  local  mismatch  on  top  of  global 
variation  compared  to  the  model. 


III,  Conclusion 

Static  noise  margin  is  a  critical  metric  for  SRAM  bitcell  sta¬ 
bility.  This  paper  has  explored  the  impact  of  different  param¬ 
eters  on  SNM  for  SRAM  bitcells  in  sub-threshold.  The  domi¬ 
nant  factor  affecting  sub-threshold  circuits  in  general  and  SNM 
specifically  is  VT  mismatch  due  to  random  doping  variation, 
and  the  criiicat  region  for  examination  is  the  tail  of  the  SNM 
PDF.  We  have  shown  that  first-order  theoretical  models  for  cal¬ 
culating  SNM  are  accurate  dose  lo  the  nominal  values  of  V*\ 
but  they  cannot  accurately  account  for  all  of  the  mismatch  cases. 
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We  have  shown  that  SNM  high  and  SNM  low  are  normally  dis¬ 
tributed  with  VT  mismatch  and  correlated.  Despite  their  corre¬ 
lation,  we  have  shown  that  treating  them  as  iid  leads  to  a  PDF 
for  SNM  that  gives  an  accurate  model  of  the  tail  cases.  This  es~ 
timate  is  invaluable  for  avoiding  long  M-C  simulations  in  the 
design  of  large  SRAMs  for  sub-threshold  operation. 
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A  256-kb  65-nm  Sub-threshold  SRAM  Design  for 
Ultra-Low-Voltage  Operation 

Benton  Highsmith  Calhoun,  Member  IEEE .  and  Anantha  P.  Chandrakasan.  Fellow ;  IEEE 


Abstract — Low- voltage  operation  for  memories  is  attractive 
because  of  lower  leakage  power  and  active  energy,  hut  the  dial* 
lenges  of  SRAM  design  tend  to  increase  at  lower  voltage.  This 
paper  explores  the  limits  of  low-voltage  operation  for  traditional 
six-transistor  (6  T)  SRAM  and  proposes  an  alternative  bitcell  that 
functions  to  much  lower  voltages.  Measurements  confirm  that  a 
25ft- kh  65-nm  SRAM  test  chip  using  the  proposed  hitcell  operates 
into  sub-threshold  to  below  400  mV,  At  this  low  voltage,  the 
memory  offers  substantial  power  and  energy  savings  at  the  cost  of 
speed,  making  it  well-suited  to  energy -constrained  applications. 
The  paper  provides  measured  data  and  analysis  on  the  limiting 
effects  for  voltage  scaling  for  the  test  chip. 

Index  Terms — Low -voltage  memory,  sub- threshold  SRAM, 
voltage  scaling* 

i.  Introduction 

SUBTHRESHOLD  digital  circuit  design  has  emerged  as  u 
tow-energy  solution  for  applications  with  strict  energy  con¬ 
straints,  Analysis  of  sub-threshold  designs  has  focused  on  logic 
circuits  ie.g**  1 1 1 L  SRAMs  comprise  a  significant  percentage  ol 
the  total  area  and  total  power  for  many  digital  chips  |2f  SRAM 
leakage  can  dominate  total  chip  leakage*  and  switching  highly 
capacitive  bitlines  and  word  lines  is  costly  in  terms  of  energy 
Lowering  L^d  lor  SRAM  saves  leakage  power  and  access  en¬ 
ergy  ,  Also*  for  sy  stem  integration,  SRAM  must  become  capable 
of  operating  at  sub-threshold  voliages  that  are  compatible  with 
sub-threshold  combinational  logic.  Overcoming  the  difficulties 
of  operating  an  SRAM  in  sub- threshold  requires  both  circuit  and 
architectural  innovations.  The  benefits  are  significant,  however* 
since  low-energy  SRAM  is  essential  for  enabling  ultra-low-en¬ 
ergy  systems.  This  paper  describes  an  SRAM  capable  of  oper¬ 
ating  in  the  sub-threshold  region. 

Previous  low-pow  er  memories  show  a  trend  of  lower  voltage 
operation.  Exploiting  dynamic  voltage  scaling  (DVS)  for 
SRAM  is  one  motivation  for  designing  a  voltage-scalable 
memory.  A  0*  1 32-kB  four- way  associative  cache  offers 
DVS  comparability  from  120  MHz,  1.7  mW  at  0.65  V  to 
i .04  GHz.  530  mW  at  2  V  [3|.  Although  DVS  can  provide 
pem  er  reduction  for  active  memories,  most  previous  approaches 
apply  voltage  scaling  primarily  to  idle  blocks  by  lowering  \  d d 
(e.g,*  [2],  [4]— [6 J  >,  raising  ground  (e.g*.  [7]-[I0])*  or  both  le.g.. 
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Fig  I  5>NM  lor  v.  rile  access  versus  temperature  and  prt»eess  comer  iTl  WW 
SS,  WS,  and  SWi  m  t  M,  =  L.3  V  lai  and  r,„  =  IKA  I hs  Negaiivc  SNM 
indicates  success! ul  wnic 


[()]).  Implementations  of  SRAM  using  lower  \  dd  in  standby 
arc  available  [5]  along  with  software  policies  to  determine 
when  to  enter  the  lower  leakage  mode  [2),  Voltage  sealing  for 
SRAM  promises  to  continue*  leading  to  sub-threshold  storage 
modes  and  evert  sub- threshold  operation  for  SRAMs  operating 
in  tandem  with  sub-threshold  logic. 

One  issue  for  deeply  voltage  sealed  SRAM  is  soft  error  rate 
(SER).  Soft  errors  occur  when  an  alpha  particle  or  cosmic  ray 
strikes  a  memory  node  and  causes  data  loss.  Since  bitcell  storage 
capacitance  decreases  with  scaling  and  voltage  scaling  further 
reduces  the  stored  charge,  SER  is  a  concern  for  sub-threshold 
memory  *  Fortunately,  there  are  methods  for  handling  soft  errors. 
Studies  of  soft  errors  have  shown  that  multi -cell  errors  from  a 
single  strike  only  occur  in  two  to  three  adjacent  cells  along  a 
word  line  f  1 2],  Thus,  physically  interspersing  bits  from  different 
words  can  prevent  multi -errors  from  occurring  in  o  single  word 
[12].  Coupling  this  with  error  correcting  codes  can  dramatically 
reduce  SER  (8]. 
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Clearly,  previous  efforts  have  explored  many  options  for 
voltage  scaling.  However,  none  have  yet  pushed  voltage  sealing 
into  the  sub-threshold  region  during  active  operation 

rt.  Six-Transistor  SRAM  Bitcell  sn  Sub-threshold 

Predictions  in  [13]  suggest  that  process  variations  will  limit 
standard  90-iwn  SRAMs  to  around  0.7  V  operation  due  to  de¬ 
graded  Read  Static  Noise  Margin  (SNM)  and  reduced  write 
margin.  Small  transistors  combine  with  random  and  systematic 
process  variations  to  cause  a  large  spread  in  Read  SNM  that 
leads  Eo  destructive  read  errors  for  hits  at  the  tail  of  the  distri¬ 
bution.  Standard  write  operation  depends  on  a  ratio  of  currents, 
and  process  variations  make  this  ratio  difficult  eo  maintain  as 
1 1  >  o  de  c  reuses  ,  lead  in g  to  w  r  i  re  e  rrt )  rs  +  I'  I  ie  se  practical  probl  e  m  s 
limit  traditional  six- transistor  (ft  T)  bitceUs  and  architectures  to 
higher  \  ui>.  above- threshold  operation  Reports  in  the  litera¬ 
ture  of  ft5-nm  SRAMs  confirm  this  voltage  barrier.  A  65-nm 
SRAM  built  in  a  dynamic -double-gate  $01  (D3G-SOI)  process 
functions  to  0.7  V  and  is  predicted  to  fail  below  1.0  V  for  bulk 
CMOS  1 14|.  A  bulk  CMOS  65-nm  SRAM  reports  a  minimum 
operating  voltage  of  0.7  V  |9|.  Our  results  confirm  that  SNM 
degradation  and  inability  to  write  are  the  two  primary  obstacles 
to  sub-threshold  SRAM  functionality,  where  they  are  exacer¬ 
bated  by  the  exponential  impact  of  IV  variations. 

■4,  Write  Operation 

Proper  write  operation  depends  on  sizing  the  access  nMOS 
to  win  the  ratioed  light  with  the  pMGS  inside  the  bitcell  to 
write  a  "0*\  Fora  successful  write,  the  bitcell  becomes  mono¬ 
stable.  forcing  the  internal  voltages  to  the  correct  values.  If  the 
cell  retains  bistability  then  the  write  does  not  occur,  and  the 
SNM  is  positive  on  the  cell's  butterfly  plot.  Thus,  a  negative 
SNM  indicates  a  successful  write  (monostability  in  the  cell). 
For  above- Vr  operation,  stronger  nMOS  devices  (due  to  mo¬ 
bility)  and  relatively  low  dependence  of  current  on  Vf  make  de¬ 
vice  sizing  successful  at  maintaining  the  proper  ratio  of  currents 
for  writing  the  cell.  For  sub-threshold,  the  ratio  of  currents  in 
p/nMOS  depends  exponentially  on  V  V-  Since  process  designers 
generally  focus  on  strong-inversion  operation,  the  sub-threshold 
pMOS  and  nMOS  current  can  be  imbalanced  for  typical  transis¬ 
tors.  Even  if  (he  pMOS  and  nMOS  currents  are  well-balanced 
at  the  typical  nMOS,  typical  pMOS  (TT)  comer,  process  varia¬ 
tion  can  still  create  a  relative  difference  in  p/nMOS  current  of  an 


order  of  magnitude  or  more.  Furthermore,  local  variations  in  IV 
from  cell  to  cell  can  aggravate  this  problem.  For  sub- IV.  sizing 
alone  is  not  a  strong  knob  for  fixing  this  problem  because  only 
unreasonable  sizing  ratios  could  account  for  the  wide  ranges  of 
possible  current  that  arise  due  to  l V  mismatch. 

In  the  65-nm  process  for  which  we  are  designing,  iso-size 
pMOS  devices  are  stronger  in  sub- if  than  nMOS  hv  roughly 
an  order  of  magnitude,  which  makes  write  functionality  more 
challenging.  Fig  I  shows  the  write  margin  meg.  SNM  means 
successful  w  rite )  of  a  ft  T  bitcell  versus  temperature  and  process 
corner.  At  \  no  —  301)  mV  in  Fig.  lun.  the  writing  fails  for  large 
regions  of  process  corner  and  temperature.  The  general  trend 
showing  an  improvement  of  write  operation  ( i  e  ,  more  negative 
margin)  at  higher  temperature  occurs  because  the  pMOS  tran¬ 
sistors  weaken  relative  to  nMOS  as  temperature  rises.  As  i  Dn 
increases,  the  write  margin  improves.  Fig.  l(bi  shows  the  write 
margin  at  O.ft  V.  This  voltage  is  above  IV.  so  the  pMOS  has 
weakened  relative  to  the  nMOS  because  the  mobility  dominates 
the  differences  in  VV.  Even  at  0.6  V,  the  write  margin  is  barely 
negative  for  the  worst-case  corner,  and  this  plot  dries  not  ac¬ 
count  for  local  I  V  variation.  For  these  reasons,  \  oo  —  V'  is 
the  best  case  voltage  for  which  we  can  expect  traditional  write 
operations  to  work  lor  a  sub-threshold  memory  in  this  65-nm 
process. 

B  Read  Operation:  Static  Noi.se  Margin 

Fig.  2  shows  a  conceptual  setup  for  modeling  SNM  [I5|. 
Noise  sources  having  value  V\  are  introduced  at  each  of  the 
internal  nodes  in  the  bitcell.  As  l  v  increases,  the  stability  of 
the  cell  reduces.  Once  l  y  exceeds  the  SNM.  then  the  cell  loses 
its  bistability  and  its  data.  Cell  stability  during  active  opera¬ 
tion  represents  a  more  significant  limitation  to  SRAM  operation 
than  during  hold.  At  the  onset  of  a  read  access,  the  wordline  is 
'T'  and  the  bitlines  are  precharged  to  *T\  The  internal  node 
of  the  bitcell  that  represents  a  zero  gets  pulled  upward  through 
the  access  transistor  due  to  the  voltage  dividing  effect  across 
the  access  transistor  ( A7_>.  A/s)  and  drive  transistor  (A7|,  V/4). 
which  degrades  the  Read  SNM.  Fig.  2  shows  example  butterfly 
curves  during  hold  and  read  that  illustrate  the  degradation  in 
SNM  during  read. 

Process  variation  makes  matters  worse  by  shifting  the  voltage 
transfer  characteristics  ( VTCs)  of  the  cell  inverters  and  creating 
a  distribution  of  SNM  for  both  hold  and  read.  A  study  of  the 
impact  of  variations  on  SNM  in  sub-Vj  appears  in  [16],  Fig,  3 
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-B-  Read  (V^-O.aV) 
Read  (VD0=aSV) 
Read(VDO=0.6V) 
^  Hold  (Vdd=0.3V) 


SNM  (mV) 


Fie:  3.  Distribution  of  Hold  SNM  at  300  mV  compared  with  Read  SNM  distributions  ai  different  voltages  Read  SNM  at  500  mV  bus  the  same  mean,  but  it  has  a 
larger  standard  deviation. 


SNM  (V) 

Fig  4  CDFs  ol  SNM  distributions  showing  that  avoiding  the  Read  SNM  al¬ 
lows  a  reduction  in  \  t>].  by  -^1  J.5  Inr  the  same  Go  stabihly 

shows  the  distribution  of  the  Read  and  Hold  SNMs  for  a  6  T  bit- 
cell  at  a  300- mV  supply  voltage.  The  mean  Read  SNM  is  only 
slightly  above  half  of  the  mean  Hold  SNM.  and  the  deviation  of 
the  Read  SNM  is  larger  than  for  the  Hold  SNM.  For  a  multiple 
megabit  memory,  numerous  cells  will  have  Read  SNM  less  than 
zero  based  on  this  statistical  analysis.  From  this  figure,  the  mean 
of  the  Read  SNM  at  500  mV  roughly  equals  the  mean  of  the 
Hold  SNM  at  300  mV.  However,  it  is  unclear  from  this  plot  how 
the  Hold  SNM  and  Read  SNM  compare  at  the  worst-case  tails. 
Fig.  4  shows  the  cumulative  distribution  function  (CDFsl  de¬ 
rived  from  the  distributions.  For  tier  probability  the  Hold  SNM 
for  a  given  l  oo  roughly  equals  the  Read  SNM  for  twice  that 
\  □£>  in  the  range  of  interest.  This  means  that  a  memory  thal 
avoids  the  Read  SNM  problem  can  operate  at  roughly  half  of 
the  1  dd  of  a  6  T  memory  with  the  same  tin  bitcell  stability, 

III.  A  SUB -THRESHOLD  BITCELL  DESIGN 

Previously  published  works  have  scaled  SRAM  into  the 
sub-threshold  region  during  idle,  but  no  SRAM  actually  oper¬ 
ates  in  this  region.  The  CLIS-/xm  memory  in  ( l  j  provides  one 
exception,  operating  into  deep sub-Vr  at  ISO  mV.  However,  the 
memory  resembles  a  register  file  (latch  with  tri state  driver  for 


writing  and  niuxed  outputs  j  and  has  an  equivalent  bitcell  size  of 
IK  T  We  can  use  this  previous  implementation  II]  as  an  end¬ 
point  in  the  range  of  bitcell  options  that  spans  from  the  6  T  hit¬ 
cell  (inoperable  below  600-700  mV  in  65  nm)  to  the  18  T  bit- 
cell.  which  will  function  robustly  in  sub-threshold  since  it  looks 
and  functions  more  like  combinational  logic.  In  between  these 
two  options  are  many  possible  bitcell  designs  that  address  the 
obstacles  to  sub-threshold  operation  by  increasing  the  number 
of  transistors  relative  to  the  6  T  cell.  The  bitcell  that  this  sec¬ 
tion  describes  [17]  w-us  selected  from  among  many  others  be¬ 
cause  it  represents  the  best  trade-off  of  functionality  and  area;  it 
is  the  smallest  bitcell  from  those  examined  that  provides  robust 
sub-threshold  functionality. 

Fig.  5  shows  the  schematic  of  the  10  T  sub-threshold  bitcell. 
Transistors  A/|  through  Mh  are  identical  to  a  6  T  bitcell  except 
that  the  source  of  M ^  and  Me,  tie  to  a  virtual  supply  voltage 
rail.  rrDD.  Write  access  to  the  bitcell  occurs  through  the  write 
access  transistors.  AT  and  Afy,  from  the  write  bitlines.  BL  and 
BLB,  Transistors  ;U7  through  A/J  tf  implement  a  buffer  used  for 
reading.  Read  access  is  single-ended  and  occurs  on  a  separate 
hitline.  RBL,  wfhich  is  precharged  to  Vdd  prior  to  read  access, 
The  wordline  for  read  also  is  distinct  from  the  write  wordline. 
One  key  advantage  to  separating  the  read  and  write  word  lines 
and  bitlines  is  that  a  memory  using  this  bitcell  can  have  distinct 
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Fig.  ft.  Schematic  l>I  read  hutler  from  H)  T  biioell  for  both  data  values,  In  Both 
eases,  leakage  is  reduced  to  the  biiline  .tnd  through  (he  inverter  relative  to  the 
ease  where  Um  is  excluded. 


read  and  write  ports.  Since  a  6  T  bnceil  does  not  have  this  fea¬ 
ture.  the  10  T  hi  tee]  i  is  in  some  ways  more  fairly  compared  to  an 
8  T  dual-port  bitcell  (6  T  bitcell  with  two  pairs  of  access  tran¬ 
sistors  and  bit  lines). 


A.  Ef u i hi h i e  Sit h - fl ) tvsht >iil  Ra u i 

The  IDT  biiccll  in  Fig.  5  uses  transistors  \It-.\I to  remove 
the  problem  of  Read  SNM  hv  buffering  the  stored  data  during  a 
read  access  As  described  previously,  eliminating  the  Read  SN  M 
problem  allows  this  bticeEl  to  operate  ai  half  of  the  \  of  a  6  T 
cell  while  retaining  the  same  bo*  stability.  A  different  approach 
for  eliminating  the  Read  SNM  m  1 1 ttj  uses  a  7  T  cell  to  pre¬ 
vent  the  higher  voltage  at  the  internal  node  from  propagating  to 
the  other  hack-to-hack  inverter  by  holding  its  data  dynamically 
during  read  accesses  This  approach  will  not  work  in  sub- 1  > 
because  the  dynamic  data  is  susceptible  to  leaking  away  during 
l he  long  access  times. 

It  is  interesting  10  note  that  a  9  T  bitcell.  identical  to  the  bit- 
cell  ui  Fig.  5  but  without  Vm,  would  eliminate  the  Read  SNM 
problem  while  using  less  area  than  the  10  T  cell.  However,  A/w 
is  valuable  to  the  bitcell  because  it  reduces  leakage  current  and 
allows  more  hi  tee  I  Is  to  share  a  bit  line.  Fig.  6  shows  ihe  read 
buffer  from  the  10  T  hi  tee  II  for  Q  —  0  (ai  and  Q  -  lib). 
When  Q  =  0  and  QB  =  l  Fig.  6(ai,  .l/lu  adds  an  ^'device  \n 
series  with  the  leakage  path  through  AP  and  the  path  through 
Ah) ,  decreasing  the  leakage  through  those  transistors.  Further¬ 
more.  since  the  pMGS  in  Eh  is  65-nm  technology  generally  has 
higher  leakage  than  the  nMOS,  the  leakage  in  AT>  holds  node 
QBB  near  TDD  (see  Fig.  7k  further  limiting  the  leakage  through 
A/s  bv  making  its  Va^  negative.  Even  if  QBB  floats  above  0  by 
only  a  small  amount,  the  negative  Vq$  in  \L  reduces  bitline 
leakage  exponentially.  When  Q  =  1  and  QB  =  U  Fig.  6(b), 
A/to  reduces  leakage  through  A/7  by  the  stack  effect  (note  that 
the  stack  of  devices  will  also  slow  down  a  read  access  by  de¬ 
creasing  read  current).  Since  node  QBB  is  held  solidly  at  Vpp. 
Mu  has  Vos  —  0.  so  bitline  leakage  is  negligible.  In  both  cases, 
M[q  reduces  the  leakage  relative  to  the  9  T  (  and  6  T)  case.  The 
10  T  only  has  1 6%  more  leakage  than  a  6  T  cell  at  the  same  Vdq 
(9  T  has  50%  more).  This  overhead  m  leakage  current  is  more 
than  compensated  by  decreasing  VDD  by  several  hundred  milli¬ 
volts  relative  to  the  6  T  bitcell.  In  simulation,  the  10  T  bitcell  at 


Fig.  7  Simulation  or  voltage  at  node  QBB  in  urtaecessed  m  T  hiicdls  \ersus 
temperature  and  process  comer  Strong  pMOS  leakage  holds  QBB  near  lm, 
except  at  rhe  SW  comer.  Even  at  SW.  QBB  is  higher  than  it  is  tor  the  ft  T  cell, 
lowering  bn  line  leakage. 


Fig.  S,  Simulation  showing  steady -state  hit  line  vintages  The  HI  I  hitcell  ex¬ 
hibits  much  better  steady-state  huline  separatum  than  the  ft  T  cell  I  hc  WW 
corner  is  shown  at  300  mV 


300  mV  consumes  2.25  X  less  leakage  power  than  the  6  T  hi tceJI 
at  0,6  V  [  1 7J, 

The  reduction  in  sub- threshold  leakage  through  A/*  reduces 
the  impact  of  leakage  from  unaccessed  cells  and  gives  the 
additional  advantage  of  allowing  more  cells  on  a  bitline  during 
read.  Leakage  from  the  bill  me  into  the  unaccessed  btieells 
causes  undesirable  voltage  drop  that  slows  differential  sensing 
and  that  makes  single-ended  read  values  difficult  to  distinguish. 
Fig.  8  shows  the  impact  of  bitline  leakage  on  steady-state  volt¬ 
ages  (note  that  the  bitline  initially  is  precharged  to  Top)  while 
reading  a“P  (solid  lines)  or  tT  (dotted  lines)  at  300  mV  For 
the  same  number  of  cells  on  a  BL,  the  IDT  bitcell  (circles  i 
shows  larger  bitline  separation  than  the  6  T  lor  9  T)  hi  ice  I  Is 
(squares),  This  figure  suggests  that  “sensing"  with  an  inverter 
(whose  switching  threshold,  V\/,  is  shown)  should  work  well 
from  0  ,5C  to  100  even  with  256  cells  on  a  bitline  for  the  I0T 
cell.  In  contrast,  the  6  T  cell  (or  9  T  bitcell  1  would  allow  at  most 
16  bkcells  on  a  bitline.  The  bitline  that  should  be  "I”  stays  very 
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Pig.  Q  Schematic  of  write  architecture  lor  a  single  row  using  a  floating  power 
supph  f  rrm> .)  Tire  rou  is  "folded"  in  layout  so  that  its  cells  share  n-wells, 
and  the  entire  row  is  written  ai  once 


write 

iWL.WR  y* 

V-j 

1  VDDon  J 

v _ 

Q  and  QB 

j 

VVDD 

1 . . . 

- r 

floating 

Fig.  10  Timing  diagram  for  write  operation  When  rIMM,  goes  low  while 
in  if  «  remains  asserted,  the  cell's  feed  hack  restores  full  voltage  levels  for 
the  new  values  of  Q  and  QB  I  point  tail 


close  to  Vp □  at  high  temperatures  and  then  begins  to  droop  at 
lower  temperatures.  This  occurs  because  Mw  inside  the  unac¬ 
cessed  10  T  bitcells  is  so  successful  at  reducing  sub- \ V  current 
through  the  access  transistors  that  the  sub- \  V  current  actually 
drops  below  the  sum  of  gate  currents  ( which  is  fairly  constant 
with  temperature)  into  un accessed  cells.  If  gate  leakage  was 
lower  (e.g..high-K  dielectrics i  then  sub-threshold  leakage  into 
live  unaccessed  cells  is  reduced  sufficiently  such  that  the  bit  line 
will  stay  very  close  to  l bn  One  advantage  of  more  cells  on 
a  BL  is  a  reduction  in  peripheral  circuits  that  offsets  some  of 
the  area  overhead  of  larger  bitcells.  For  example,  an  8  T  hi  ted  I 
(40#  larger  than  6  T)  that  allows  256  cells  per  hilline  rather 
than  16  {same  improvement  as  our  cell)  actually  resulted  in  a 
6#  smaller  overall  array  area  1 10| 
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//  Enabling  Stth~thrc\hnhi  Write 

In  this  65-nni  technology,  a  6T  bit  cel  \  cannot  write  in  the  tra¬ 
ditional  fashion  below  around  0.6  V  because  the  nMOS  access 
transistor  cannot  reliably  u  in  the  ralioed  light  against  the  pMOS 
to  write  a  “0”.  The  technique  of  weakening  the  cross-coupled  in¬ 
verters  by  gating  thetr  supply  voltage  (e.g.,  J6]i  or  ground  node 
ie,g..  120]).  applied  by  previous  works  primarily  to  improve 
speed,  can  dramatically  improve  write  margin.  Fig.  9  shows 
the  schematic  for  a  single  row  using  this  approach.  A  single 
power-supply -gating  header  switch  connects  node  H  pp  to  the 
true  power  rail  When  the  bitcell  holds  its  data  or  during  read 
accesses,  VoLhit*  ~  b  so  that  \TDr>  ==  rDD-  During  a  write 
access,  the  virtual  rail  floats.  For  the  implementation  on  the  test 
chip,  a  conceptual  row  folds  as  shown  in  the  figure  so  that  its 
bitcells  can  share  n-wells,  and  the  entire  row  is  written  at  once. 

Fig  10  shows  the  timing  associated  with  a  write  access  using 
this  scheme.  First,  the  write  signal  goes  high  to  indicate  that  a 
write  access  will  occur,  and  the  bitlines  are  driven  w  ith  the  new 
data.  Next,  the  decoders  drive  a  global  wordline  (not  shown) 
which  causes  the  proper  local  write  wordline  (H'Lh  /?)  to  go 
high.  Triggered  by  the  local  wordline,  the  rDDmj  signal  goes 
high,  allowing  node  VVDd  to  float.  As  the  write  access  transis¬ 
tors  discharge  the  virtual  rail,  its  voltage  droops,  and  Q  and  QB 
change  to  their  new  values.  The  logical  "I"  inside  the  cell  tracks 
the  drooping  voltage  tint  ill  oD<m  goes  low  again  while  the  local 
Hardline  remains  high,  and  the  virtual  rail  reconnects  to  1  bp. 
The  feedback  inside  the  bitcell  then  holds  the  Q  and  QB  nodes 
at  their  correct  logical  values  and  amplifies  the  *T‘  to  full  Vqd 
tpoim  (a)  in  Fig.  10),  The  plot  in  Fig.  I  I  shows  the  write  margin 


for  the  virtual  l  hll  approach  across  temperature  and  process 
corner  at  \  'pp  =  -UK J  mV  The  margin  remains  negative  across 
all  of  these  ranges,  indicating  a  successful  write 

iv  65- nm  Sub-threshold  SRAM  Test  Chip 
A .  Test  Chip  Arch  it  edit  re 

A  256-kb  65-nm  bulk  CMOS  test  chip  uses  the  10  T  bitcell 
and  the  architecture  shown  in  Fig.  12.  The  memory  has  eight 
32-kb  blocks  with  256  rows  and  128  columns  each.  A  single 
128-bit  DIO  bus  serves  all  eight  blocks.  In  this  initial  instan¬ 
tiation  of  the  sub- threshold  memory,  only  one  read  or  write 
can  occur  per  cycle,  however  the  I0T  biieell  would  allow  a 
read  and  write  access  to  the  same  block  in  one  cycle.  Such  a 
dual-port  instantiation  of  the  memory  would  require  a  second 
D1Q  bus  and  additional  peripheral  logic.  A  combined  global 
wordline  and  block  select  signal  assert  a  local  wordline  that 
triggers  either  or  W Lwn>  For  a  write  access,  Mp{ rj 

for  the  accessed  row  turns  oil  .  The  write  drivers  consist  simply 
of  inverters  with  transmission  gates,  which  turn  off  when  the 
memory  is  not  writing  to  minimize  leakage  on  the  write  hitlines 
(BL  and  BLB  l  The  power  supply  to  the  WL  drivers  is  routed 
separately  to  allow  a  boosted  WL  voltage.  This  technique  im¬ 
proves  the  access  speed  and  increases  the  robustness  to  local 
variations.  The  read  bitline  (RBL)  is  precharged  prior  to  read 
access,  and  its  steady-state  value  is  “sensed'*  using  a  simple 
inverter  //?d((-)  Column  and  row'  redundancy  is  a  ubiquitous 
technique  in  commercial  memories  used  to  improve  yield.  For 
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Fit*  12.  \rcbi tecrure  diagram  of  the  256-Lb  memory  on  the  test  ehip  usim*' 
10  T  sub- threshold  hueclls 


our  analysis  ol  the  SRAM,  we  assume  the  availability  of  one  re¬ 
dundant  row  and  column  per  block. 

The  primary  goals  for  this  test  chip  were  to  icsi  e her  function¬ 
al  iiv  of  l he  10  T  biicell  in  sub-threshold  and  to  explore  the  lim¬ 
it  at  tons  of  i he  design*  For  tins  reason*  all  of  ihe  peripherals  use 
static  CMOS  logic  for  simplicity  and  for  functional  robustness 
in  sub-threshold.  The  large  block  size  was  intentionally  aggres¬ 
sive  to  expose  limitations  in  the  bitcell  and  architecture.  Inte¬ 
grating  256  bi  Teel  Is  on  the  bitline  (as  opposed  to  16  for  6  T) 
pushes  the  envelope  for  functionality.  The  ID  T  bilcell  layout 
added  66%  area  overhead  relative  lo  our  reference  6  T  design, 
but  the  overall  area  penalty  will  be  less  due  to  more  hi  ted  Is 
on  a  bitline,  as  described  in  Section  III- A.  Each  row  is  folded 
such  that  a  pair  of  64-htt  physical  rows  sharing  n- wells  and  a 
V  1  dd  rail  makes  up  one  conceptual  128-bit  row  fe.f*  Fig.  9). 
This  folding  increases  the  length  of  bitlines  by  roughly  2X  and 
decreases  the  length  of  wordlines  by  roughly  1 1/2 IX,  Notice  that 
this  is  not  fundamentally  necessary'  for  the  write  approach  to 
work.  The  n- wells  of  two  separate  rows  can  be  shared  and  the 
l Tod  for  each  row  routed  separately.  The  10  T  architecture 
does  not  change  the  numbered' WLs  or  BLs  or  the  number  of  de¬ 
vices  per  tine  relative  to  the  A  T  case,  except  that  it  has  one  fewer 
read  BL*  The  capacitance  of  the  metal  lines  themselves  w  ill  in¬ 
crease  somewhat  due  to  the  larger  bitcell  area.  Fig.  13  shows 
a  layout  shot  and  die  photograph  of  the  test  chip  { 1,89  mm  by 
L 12  mm*  pin-limited), 

B.  Measurements 

Measurements  of  the  SRAM  test  chip  confirm  that  it  is  func¬ 
tional  over  a  range  of  voltages  from  1,2  V  down  wfell  into  the 
sub-threshold  region.  With  the  assumption  of  one  redundant  row 
and  column  per  block*  read  operation  works  without  error  to 
320  mV  and  write  operation  works  without  error  to  380  mV  at 


lriu.  13  t:n  Mimuaied  luymil  and  i h J  die  photograph  n|  ihv  256-kb 
MilHhrcUniW  SRAM  mi  65  ntn  Dec  -a/v  i*;  1  SU  mm  hv  M2  mm 


27  ,C.  We  continued  lo  push  the  supply  voltage  to  even  lower 
values  to  examine  the  limits  of  the  implementation.  At  the  low 
supply  voltage  of  3(H)  mV.  the  memory  continues  to  function, 
but  tt  exhibits  ht  terrors  in  J  %  of  its  bits  that  result  from  sensi¬ 
tivities  in  ihe  architecture  to  local  device  variation*  as  described 
later. 

The  test  chip  successfully  demonstrates  a  functional 
sub-threshold  memory  that  overcomes  the  problems  it  was 
designed  to  face.  First,  the  bitcell  removes  the  Read  SNM 
problem.  Measurements  have  confirmed  that  the  memory  ex¬ 
periences  zero  destructive  read  errors  at  300  mV  Simulations 
show  that  a  6  T  memory  would  experience  a  high  rate  of 
destructive  read  errors  at  300  mV  due  to  degraded  Read  SNM. 
Second*  whereas  a  6  T  memory  would  fail  to  write  below  about 
600  mV*  this  memory  writes  correctly  at  350  mV  at  85  ]C. 
Third,  a  6  T  memory  would  experience  problems  reading  with 
only  16  bitcells  on  a  bill ine.  Measurements  show  that  the  10  T 
memory  reads  correctly  even  with  256  bitcells  on  the  hitline 
down  to  320  mV,  Finally*  the  memory  shows  good  Hold  SNM 
performance.  The  first  bits  observed  to  fail  to  hold  their  data 
occur  at  \DD  <  2110  rnV*  as  seen  in  the  distribution  shown  in 
Fig.  14. 

Fig*  15  shows  the  measured  leakage  power  of  the  test  chip  at 
two  different  temperatures  and  the  expected  savings  from  VDl) 
scaling.  At  27  °C*  the  10  T  memory  saves  2.5X  and  3*8 X  in 
leakage  power  by  scaling  from  0,6  V  to  0.4  V  and  0.3  V*  respec¬ 
tively  and  over  60X  when  FDD  scales  from  1 .2  V  to  0,3  V.  VDD 
scaling  also  gives  the  expected  savings  in  active  energy  per  read 
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Fi£.  14  Measured  distribution  of  minimum  voltage  at  which  bitceils  hold  both 
'll"  and  "l*\ 
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Fig  I  "  Measured  frequency  o I  operation  versus  Vj^  ,, 
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Fig  HV  Measured  active  energy  per  read  access. 


access*  as  shown  in  Fig,  16.  Fig,  17  shows  the  measured  fre¬ 
quency  of  operation  versus  Vbo  nhe  I  -  ^  speed  of  200  MHz 
is  a  simulation  result,  because  the  testing  board  did  not  support 
high-speed  testing).  The  maximum  measured  operating  speed  at 
4(H)  mV  is  475  kHz* 

Pushing  \  dd  even  lower  exposes  the  limitations  for  both  read 
and  write  operations  as  a  small  fraction  of  bits  begins  to  fail. 
These  failures  occur  for  the  same  set  of  bits  in  a  repeatable 
fashion.  For  certain  bits  at  low  voltage,  a  read  access  shows 
that  the  bitcell  holds  a  "0"  when  in  fact  it  holds  a  "1".  This 
error  is  non-destructive*  which  we  confirm  by  raising  the  supply 
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voltage  and  rc -reading  the  cell  and  invariably  reading  the  cor¬ 
rect  value.  Also*  these  bit  errors  tend  to  gather  along  a  small 
number  ot  specific  columns.  The  fact  that  this  error  exists  m  a 
small  fraction  of  eases  indicates  that  it  results  from  local  de¬ 
vice  variation,  and  we  can  isolate  the  problematic  transistors. 
The  fact  that  the  bits  exhibiting  problems  cluster  along  specific 
columns  indicates  that  variation  in  the  sensing  inverter*  Im>  in 
Fig,  12*  has  shifted  its  switching  threshold,  rA/*  towards  rDD. 
Now*  specific  bitceils  along  this  column  that  have  read  access 
transistors  weakened  by  local  variation  cannot  hold  the  read 
bitline  above  1  ‘u  of  inverter  //?D.  Several  experiments  con¬ 
firm  that  this  is  the  mechanism  for  read  failures.  First*  we  can 
independently  lower  the  supply  voltage  of  the  sense  inverters* 
?rd{v)  This  lowers  the  \  \j  for  the  inverters,  and  the  measured 
bit  error  rate  decreases.  Secondly*  we  can  increase  temperature, 
which  provides  the  expected  improvement  in  discerning  a  #T‘ 
(c.f.  Fig.  S).  Finally*  we  can  increase  the  voltage  of  the  word- 
line  drivers,  which  pulls  the  transistors  that  weakened  by  local 
variation  back  toward  the  mean  and  rapidly  decreases  the  read 
bit  errors.  Fig*  IS  shows  the  measured  percentage  of  bit  errors 
during  read  access  versus  V  Dp.  The  error  rate  without  the  word¬ 
line  boosted  by  100  mV  also  is  shown,  in  sub-threshold,  the 
extra  gate  voltage  on  the  read  access  transistors  provides  over 
I  OX  (due  to  the  sub-threshold  slope!  of  extra  current  drive.  As 
whth  above-threshold  memories*  the  extra  current  speeds  oper¬ 
ation.  it  also  makes  the  design  more  robust  to  mismatch* 
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By  aggressively  choosing  a  block  having  256  rows  on  a  single 
bitline,  we  pushed  the  Limits  of  read  operation  and  exposed  the 
limits  to  scaling  read  accesses  that  result  from  local  variation. 
Also,  using  a  simple  inverter  for  sensing  makes  it  harder  to  read 
a  “1”  correctly.  As  the  bitJine  separation  plots  have  shown,  V Az¬ 
of  the  sensing  inverter  lies  too  close  to  the  logical  ■*)"  value  at 
some  comers  and  temperatures.  Boosting  the  wordline  voltage 
offers  one  simple  change  that  dramatically  reduces  the  error  rate 
and  allows  this  memory  to  read  without  error  at  320  mV.  A 
better  solution  to  improve  the  read  reliability  and  robustness 
to  local  device  variation  is  to  replace  the  inverter  with  a  new- 
sensing  scheme,  for  which  many  relevant  sense  amps  are  avail¬ 
able  in  the  literature. 

The  limit  to  Vud  scaling  for  write  manifests  when  write 
accesses  fail  for  specific  bitcells.  Write  functionality  was  tested 
using  a  high  voltage  write,  a  low -voltage  write  of  the  opposite 
value,  and  finally  a  high  voltage  read.  This  test  isolates  the  bits 
for  which  sub- threshold  write  fails.  These  errors  aggregate  in 
bits  along  specific  rows.  As  with  the  read  errors,  local  device 
variation  is  the  culprit,  and  the  predominance  of  row-wise 
errors  suggests  that  the  failure  mechanism  involves  the  row 
peripherals.  Referring  back  to  Fig,  12,  write  limitations  first 
appear  along  specific  rows  whose  pull-up  device,  Mpy  is 
strengthened  hv  local  variation.  Thus,  when  Mp  turns  off 
during  a  write  access,  its  larger  leakage  pulls  V'V'ijp  up  closer 
to  \  i>n-  Some  of  the  bitcells  can  still  switch  under  these 
conditions,  but  \'\\yo  reaches  a  steady-state  voltage  that  is 
high  enough  to  prevent  some  bitcells  from  overpowering  the 
piVlOS  to  write  a  'IF  into  the  memory.  In  these  bitcells.  local 
mismatch  has  made  the  internal  pMOS  relatively  stronger  than 
the  access  transistor  to  the  point  that  the  write  driver  cannot 
Hip  the  cell  at  the  steady-state  l T'on  voltage.  Measurements 
confirm  that  this  is  the  ease.  First,  the  lowest  functional  supply 
voltage  decreases  at  higher  temperature.  Since  the  leakage 
through  .Mp  ^cts  relatively  weaker  compared  with  the  nMOS 
access  transistors,  this  confirms  the  mechanism  for  failure. 
More  importantly,  the  write  errors  decrease  when  the  supply 
voltage  to  the  wordline  increases.  The  higher  wordline  voltage 
increases  VGS  for  the  write  access  transistors  and  makes  them 
more  capable  of  producing  voltage  droop  on  ^  ^  D D ►  Fig,  19 
shows  the  percentage  of  bit  errors  measured  during  write  both 
with  and  without  100  mV  of  word  line  boosting.  With  boosting, 
the  memory  can  write  without  error  at  380  mV  at  27  °C  and 
350  mV  at  85  *C, 

As  with  the  limitations  on  read,  simple  changes  to  the  periph¬ 
eral  circuits  can  push  the  lowest  operational  I'dd  even  lower. 
Specifically,  the  leakage  through  Mp  can  be  reduced  using  one 
of  several  well-known  methods  (e,g„  stacking,  RBB,  etc.)  A 
better  solution  to  the  write  issue  that  maintains  the  same  basic 
architecture  and  approach  is  to  induce  a  specific  voltage  drop 
on  V'Voo  intentionally,  fn  the  extreme,  replacing  Mp  with  an 
inverter  will  drive  t'Vbo  ail  the  way  to  0  V,  Then,  as  long  as 
the  write  wordline  remains  asserted,  the  bitcells  will  develop  the 
correct  internal  data  when  V\ bo  goes  back  high  regardless  of 
local  variations,  A  disadvantage  of  this  extreme  case  is  the  en¬ 
ergy  penalty  associated  with  discharging  and  re -powering  the 
V'  Mod  rail  and  all  of  the  bitcells  in  the  row.  An  alternative  is 
to  use  a  circuit  (e.g,.  diode  connected  FET)  to  force  VVqd  to 


some  intermediate  value  that  is  low  enough  to  ensure  write  but 
that  uses  less  energy. 

v.  Summary  and  Conclusions 

Sub-threshold  SRAM  provides  (he  dual  advantages  of  min¬ 
imizing  total  memory  energy  consumption  and  of  providing 
computability  w  iih  minimum-energy  sub- threshold  logic.  Tra¬ 
ditional  6  T  SRAM  cannot  function  in  sub- threshold  because  it 
fails  to  write  and  because  the  Read  SNM  degrades  badly.  Fur¬ 
thermore,  bitline  leakage  in  6  T  SRAMs  limits  the  number  of 
bitcells  on  a  bitline  to  1(7.  Measurements  of  a  256- kb  65-run  bulk 
CMOS  test  chip  show  that  our  10  T  hitcell  fundamentally  solves 
the  Read  SNM  problem,  overcomes  the  write  problem,  and 
relaxes  the  bit  line  integration  limitation  to  allow  sub- threshold 
operation.  With  one  redundant  row  and  column  per  block  and  a 
boosted  w  ord  line,  the  memory  functions  w  i  thout  error  to  below 
380  mV  At  400  mV,  it  consumes  3.28  p\V  and  works  tip  to  475 
kHz.  Although  aggressive  design  exposes  the  limitations  of  the 
architecture  in  terms  of  its  robustness  to  local  device  variation, 
the  bit  errors  result  primarily  from  problems  in  the  peripheral 
circuits.  Simple  proposed  changes  to  the  periphery  promise  to 
push  the  limits  of  SRAM  operation  to  even  lower  Vdq. 
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Abstract — A  500-MS/s  5-hit  ADC  for  UWB  applications  has  been 
fabricated  in  a  65-nm  CMOS  technology  using  no  analog-specific 
processing  options.  The  time-interleaved  successive  approximation 
register  (SAR)  architecture  has  been  chosen  due  to  its  simplicity 
versus  Hash  and  its  amenability  lo  scaled  technologies  versus 
pipelined,  which  relies  on  operational  amplifiers.  Six  time-in¬ 
terleaved  channels  are  used,  sharing  a  single  dock  operating  at 
the  composite  sampling  rate.  Each  channel  has  a  split  capacitor 
array  that  reduces  switching  energy,  increases  speed,  and  has 
similar  INL  and  decreased  DNL,  as  compared  to  a  conventional 
binary -weighted  array.  A  variahie  delay  line  adjusts  the  instant  of 
latch  strobing  to  reduce  preamplifier  currents.  The  ADC  achieves 
Nyquist  performance,  with  an  SNDR  of  27.8  and  26.1  dR  for 
3*3  and  239  MHz  inputs,  respectively.  The  total  active  area  is 
0.9  mm2,  and  the  ADC  consumes  6  mVV  from  a  1.2- V  supply. 

Index  Terms — ADC,  analog-to-digital  conversion,  deep-submi¬ 
cron  CMOS,  successive  approximation  register,  ultra- wideband 
radio. 


1.  Introduction 

ULTRA  WIDEBAND  (UWB)  radio  is  an  emerging  tech¬ 
nology  for  very-high-data-rate*  short  distance  wireless 
communications.  Both  OFDM  [1]  and  pulse-based  [2]  solu¬ 
tions  are  being  developed  to  achieve  data  rates  in  excess  of 
480  Mb/s*  UWB  receivers  require  high-speed  but  low-reso¬ 
lution  analog-to-digital  converters  (ADCs),  in  the  range  of 
4-5  bits  [3|-[5J*  The  ADC  in  this  work  is  targeted  for  specifi¬ 
cations  (5  bit*  500  MS/s)  compatible  with  a  custom  pulse-based 
UWB  transceiver  [6].  [7],  where  100  Mb/s  communication 
is  achieved  using  BPSK- modulated  500-MHz-wide  Gaussian 
pulses  transmitted  in  one  of  14  bands  between  3.1=10.6  GHz* 
The  Rash  topology,  along  with  its  interpolating  and  folding 
variants,  has  been  the  conventional  choice  for  high-speed*  low- 
resolution  ADCs  [&]-[  1 2].  While  Rash  can  maintain  the  highest 
throughput*  it  requires  an  exponential  growth  in  the  number  of 
comparisons  with  the  resolution.  The  ensuing  complexity  moti¬ 
vates  the  use  of  other  architectures. 

Pipelined  ADCs  are  used  for  high-speed,  medium -resolution 
applications  [13],  [14].  They  can  provide  one  conversion  per 
clock  period  throughput  and  only  a  linear  scaling  in  complexity 
with  resolution;  how-ever*  they  rely  on  operational  amplifiers  at 
the  heart  of  the  multiplying  digital-to-analog  converter  {MD  AC) 
in  each  pipelined  stage.  Because  it  must  be  closed  loop  stable, 
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this  amplifier  typically  uses  one  or  two  high  gain  stages*  Un¬ 
fortunately,  in  deep-submicron  CMOS,  the  achievable  gain  per 
stage  is  limited  because  short-channel  effects  lower  gmr0  for 
a  single  transistor,  and  reduced  voltage  supplies  restrict  circuit 
techniques  such  as  cascoding.  Thus,  there  are  significant  chal¬ 
lenges  for  continued  scaling  of  pipelined  ADCs. 

Very  recently,  for  the  high-speed,  low  resolution  converters 
necessary  for  UWB*  the  time- interleaved  successive  approxima¬ 
tion  register  (SAR)  architecture  has  re- emerged1  as  a  low-power 
alternative  to  flash  and  pipelined  ADCs  [17]*  At  the  required 
speeds,  their  major  limitation  is  digital  power;  a  SAR  converter 
includes  digital  feedback  in  the  critical  path.  A  full  custom  logic 
controller  with  dynamic  registers  can  reduce  digital  power  sig¬ 
nificantly,  but  it  still  remains  a  dominant  source  of  power  con¬ 
sumption  in  a  0*18-/im  CMOS  implementation  [18|.  Another 
approach  uses  dynamic  registers  with  asynchronous  operation 
to  reduce  clock  power,  and  combined  with  a  non- binary  suc¬ 
cessive  approximation  algorithm,  has  led  to  a  very  energy  ef¬ 
ficient  design  in  0. 13-/mi  CMOS  [19],  Fortunately,  technology 
scaling  improves  the  digital  power  and  speed  without  many  of 
the  issues  plaguing  pipelined  converters.  The  only  active  analog 
component  in  a  SAR  ADC*  the  comparator,  still  requires  large 
gain  and  bandwidth,  but  because  it  does  not  have  to  be  linear* 
this  gain  can  be  achieved  through  cascaded  stages  and  positive 
feedback. 

This  paper  presents  a  500-MS/s  5- hit  ADC  fabricated  in  a 
65-nm  CMOS  technology  [20],  At  [he  maximum  sampling  rate* 
the  ADC  consumes  6  tnW  from  a  L2-V  supply.  This  low  power 
consumption  is  achieved  through  proper  architecture  selection, 
a  new  capacitor  array,  and  careful  timing  allocation  between  the 
digital  and  analog  circuits.  The  ADC  has  six  time- interleaved 
SAR  channels  synchronized  to  a  common  clock.  The  split  ca¬ 
pacitor  array  reduces  switching  energy,  is  robust  to  digital  delay 
mismatches  for  overall  improved  set! ling  time*  and  has  a  re¬ 
duction  in  peak  static  differential  nonlinearity  (DNL).  In  the 
comparator*  a  variable  delay  line  adjusts  the  instant  of  strobing 
for  the  regenerative  latches,  minimizing  idle  time  during  each 
bit-cycle  without  sacrificing  bit  error  rate  (BER)  performance. 

II.  ADC  Architecture: 

A  SAR  ADC  requires  one  period  for  sampling  and  h  periods 
to  resolve  the  b  digital  output  bits.  To  make  the  internal  SAR 
clock  synchronous  to  the  overall  sampling  clock,  six  time- inter¬ 
leaved  channels  are  used,  as  shown  in  Fig.  I ,  Thus*  only  a  single 
500  MHz  clock  is  required  in  the  prototype*  easing  dock  gen¬ 
eration  and  distribution.  The  channels  synchronize  by  passing 

^Time-interleaved  SAR  was  used  as  early  as  19K0  as  a  low  area  alternative 
to  the  flash  ADC  f  15].  and,  more  recently,  for  reduced  comparator  power  in  a 
medium  resolution  application  [16], 
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Fig  1.  Top- level  block  diagram  of  she  6- way  time* interleaved  ADC. 


Fig,  2,  8 1  oc  k  d  i  agram  of  the  c  ban  neb  whi  eh  has  a  eapac  in  ve  DAC  eompara  Lor, 
and  digital  logic. 


a  token  to  cue  their  start  of  sampling,  and  all  critical  sampling 
sdges  are  aligned  to  the  same  shared  clock  [18].  Timing  skew 
between  channels  is  thus  limited  to  routing  variations  to  the 
;hannels  and  the  delay  mismatch  through  a  single  register  in 
;ach  channel;  both  of  these  error  sources  can  be  kept  suffi- 
:iently  small  such  that  digital  timing  correction  (a  complex, 
xwer  hungry  process  [21  j)  is  not  necessary. 

The  channel,  shown  in  Fig.  2,  consists  of  a  capacitive  dig- 
tal-to-analog  converter  (DAC),  a  comparator,  and  control  logic 
itself  called  the  SARi  The  control  logic  switches  the  DAC 
ising  a  binary  search  algorithm  to  minimize  the  error  between 
he  digital  output  and  the  analog  input.  The  split  capacitor 
trray  and  comparator,  the  two  analog  blocks,  are  discussed  in 
Section  III,  followed  by  some  of  the  considerations  used  in 
lesigning  circuits  for  65mm  CMOS. 

III.  Circuit  Design 
\ .  Spt it  Capacito r  A  rray 

The  DAC  serves  two  purposes  in  a  SAR  converter:  it 
samples  the  input  charge,  and  it  generates  an  error  voltage 


between  the  input  and  current  digital  estimate.  The  conventional 
DAC  choice  is  a  bin  ary- weighted  capacitor  array  [22],  as 
shown  in  Fig,  3,  which  is  insensitive  to  stray  capacitance. 
As  shown  in  [23],  however,  the  conventional  capacitor  array 
uses  charge  inefficiently  during  a  conversion.  To  demonstrate 
this,  a  conversion  of  a  2-bit  capacitor  array  is  presented  here. 
During  the  first  bit  decision  after  sampling,  the  MSB  capacitor 
is  connected  to  VrEF  with  the  remaining  capacitors  connected 
to  ground  (left  circuit  in  Fig.  4).  The  output  of  the  capacitor 
array,  VXl  is 


K\'  =  ~V\  n  +  -VreF  (1) 

where  is  the  input  voltage  sampled  on  the  capacitor  array 
and  Vree  is  the  reference  voltage.  During  the  second  bit-cycle, 
the  SAR  does  one  of  two  transitions.  If  V\  <  0,  an  "up"  tran¬ 
sition  is  performed,  where  C i  is  switched  from  ground  up  to 
VrEF,  drawing 


E\ip  — 


CoVi 


REF 


(2) 


from  the  reference  voltage  supply.  Inversely,  if  V\  >  0,  a 
“down"  transition  is  performed  (Fig.  4);  Ch  and  C2  switch 
places.  If  they  switch  at  the  same  lime,  the  energy  required  is 


i<jwii+cfinv 


-  tCoVreF- 


(3) 


It  takes  5  times  more  energy  to  lower  V_\  than  to  raise  it;  this 
occurs  because  all  of  the  charge  initially  on  C2  is  discharged  to 
ground,  and  all  the  charge  that  ends  up  on  C\  must  be  delivered 
from  the  reference  voltage  supply. 

Ref.  [23]  analyzes  three  alternatives  to  the  conventional  ca¬ 
pacitor  array  and  switching  procedure.  Of  these  alternatives,  this 
work  implements  the  split  capacitor  array  because  it  has  both 
the  lowest  switching  energy  and  does  not  require  an  extra  clock 
phase  that  would  limit  high  speed  operation.  A  ft- bit  split  capac¬ 
itor  array  is  shown  in  Fig,  5;  the  MSB  capacitor  of  the  conven¬ 
tional  array  has  been  split  into  an  identical  copy  (MSB  subarray) 
of  the  rest  of  the  array  (main  subartay).  These  arrays  are  placed 
in  parallel  (common  top  plate),  not  to  be  confused  with  the  series 
connected  capacitor  arrays  used  in  the  sub- DAC  approach.2  The 
total  capacitance  of  the  split  capacitor  array  is  2 bC0,  identical  to 
the  conventional  case,  and  the  area  requirements  are  unchanged. 

The  split  capacitor  switching  algorithm  is  presented  in  Fig,  6. 
Here,  the  two-bit  example  from  above  is  repeated  for  the  split 
capacitor  array  to  demonstrate  the  switching  method  and  en¬ 
ergy  savings.  During  the  first  bit-cycle  (left  side  of  Fig.  7),  the 
MSB  subarray.  C2.\  and  C2to,  is  connected  to  Tree,  and  the 
main  subarray  is  connected  to  ground.  Since  C2  =  C2,\  4- 
(1)  also  represents  the  output  of  the  split  array.  In  the  case  of 
an  “up"  transition,  the  array  transitions  in  the  same  method  as 
above*  with  C\  switching  to  VrEF*  consuming  the  same  energy 
calculated  in  (2).  In  the  "down"  transition  (Fig.  7),  half  of  the 
MSB  subairay,  C2ki  is  lowered  to  ground,  leaving  both  Ch  and 
C2to  unchanged.  By  only  switching  one  capacitor  the  energy 


-Historically,  the  combination  of  capacitive  main-  an d  sub-DACs  had  been 
called  a  “split  array”  f  15].  but  this  has  not  become  common  usage,  and  we  have 
co-opted  the  term  for  the  new  structure 
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Fig.  3,  Conventional  fr-bit  binary  weighted  capacitor  array 


Fi  g .  4  '*  Down"  trans  m  on  of  the  convent  iona  I  c  apac  t  tor  array . 


value  and  some  error  term: 

C„=2n-,C0+Sn 

CV„=2'‘-1Cp  +  tf[,n.  (5) 

Initially,  consider  only  the  case  where  all  the  errors  are  in  the 
unit  capacitors,  whose  values  are  independent  identically-dis¬ 
tributed  (Li.d.)  Gaussian  random  variables;  later  in  this  section, 
other  non -idealities  will  be  considered.  Then  the  error  terms  Sn 
and  ShTn  have  zero  mean,  are  independent,  and  have  variance 


consumed  is 


E  $]  =  E  [6ln]  =  2 -‘(jg 


(6) 


£cln 


spl  it 


^'<AftEF 

4 


(4) 


identical  to  the  "up"  transition. 

The  overall  energy  savings  of  the  split  capacitor  array  is  input 
voltage  (or  output  digital  code)  dependent.  Where  the  relative 
frequency  of  "down"  transitions  is  greater,  the  savings  for  the 
split  capacitor  array  is  enhanced,  as  seen  in  Fig.  S,  Assuming  a 
full  swing  sinusoidal  input  distribution,  the  split  capacitor  array 
is  expected  to  have  37%  lower  switching  energy  than  the  con¬ 
ventional  array. 

For  this  high-speed  implementation,  an  additional  advantage 
of  considerable  significance  is  related  to  the  array's  settling 
time.  During  a  ^down"  transition,  two  capacitors  are  required 
to  switch  for  the  conventional  capacitor  array;  any  mismatch, 
whether  random  or  deterministic,  in  the  digital  logic  driving 
these  switches  can  cause  the  capacitor  array  to  initially  transition 
in  the  wrong  direction,  potentially  exacerbating  an  overdrive 
condition  for  the  preamplifiers.  Only  one  capacitor  in  the 
split  capacitor  array  transitions  during  any  bit-cycle,  providing 
inherent  immunity  to  the  skew  of  the  switch  signals.  Simulation 
results  comparing  the  settling  times  of  the  two  arrays  is  shown 
in  Fig,  9.  For  the  simulation,  the  total  width  of  the  switches 
is  identical  for  the  split  and  conventional  arrays.  The  split 
capacitor  array  settles  up  to  10%  faster,  which  is  used  to  reduce 
the  bias  currents  in  the  preamplifiers  by  a  similar  amount. 

I)  Linearity  Performance:  To  compare  the  theoretical  static 
linearity  of  the  binary-weighted  and  split  DACs,  each  of  the 
capacitors  is  modeled  as  the  sum  of  the  nominal  capacitance 


where  oq  is  the  standard  deviation  of  the  unit  capacitor. 

The  linearity  of  a  SAR  ADC  is  limited  by  the  accuracy  of 
the  DAC  outputs,  which  are  calculated  here  for  the  case  of  no 
initial  charge  on  the  array  (\]y  —  0).  For  a  given  DAC  digital 
input  y  —  Su2Tl~lt  with  S„  equals  0  or  1  represents  the 

ADC  decision  for  bit  n,  the  analog  output  for  the  conventional 
binary-weighted  array  is 

£  (2"-lCo  +  «„)  Sn 

Vx,conv(v)  =  "  =  1  +  AC. - Vref-  (7) 

The  second  term  in  the  denominator  AC  —  5Zn=u^n  will  be 
neglected  for  this  discussion.  This  will  make  the  analysis  sim¬ 
pler  but  will  prevent  a  complete  closed  form  solution  for  the  in¬ 
tegral  nonlinearity  (INL).  Subtracting  the  nominal  value  yields 
the  error  term 

£  6nsn 

Vert (y)  ~  — Vref  (8) 

with  variance 

£  2'l~l^Sn 

EKM  - ^EF 

~^c|VREf.  (9) 
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Fig.  5.  The  kbit  split  capacitor  array,  with  the  main  subiuray  on  top  and  the  MSB  subarray  below 


Fig,  6  Switching  procedure  lor  split  capacitor  array  t  represents  the  bit  cur¬ 
rently  being  decided 

This  voltage  error  is  simply  the  sum  of  the  errors  from  y  umi 
capacitors  connected  to  Vref-  Because  the  errors  in  the  unit 
capacitors  are  assumed  to  be  i.i.d.,  ii  does  not  matter  which  unit 
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Fig.  7.  *lDown"  transition  of  the  split  capacitor  array  The  "up”  transition  en¬ 
tails  switching  C i  to  \  ref 


capacitors  are  connected  to  Vref  but  only  the  total  number 
Thus,  (9)  holds  for  the  case  of  the  split  capacitor  array  as  well. 
This  error  is  also  directly  related  to  the  INL  of  the  ADC  and 
thus  there  should  be  no  difference  between  the  maximum  INLs 
of  the  two  arrays. 

The  DNL  of  the  capacitive  DAC  is,  neglecting  gain  errors,  the 
difference  between  the  voltage  errors  at  two  consecutive  DAC 
outputs,  as  in 

DNL(y)  ^  ±V^(y)  =  Vm{y)  -  Vm(»  ~  1),  (10) 

The  worst  case  DKL  for  the  binary  weighted  capacitor  array  is 
expected  to  occur  at  the  step  below  the  MSB  transition,  where 
its  variance  is 
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Hig,  8  Normalized  switching  energies  of  the  conventional  and  split  capacitor 
arrays  versus  output  code.  The  number  of  “down'T  transitions  is  greater  on  the 
left  side  of  the  plot, 


Fig.  9  Simulation  of  the  settling  time  of  the  split  and  conventional  capacitor 
arrays  under  the  presence  of  digital  liming  skew. 


For  the  split  capacitor  array,  the  worst  case  DNL  also  occurs  at 
the  step  below  the  MSB  transition,  but  its  value  is 


AVPrr(2fc-1)  = 


6-1  / 6—2  6-2  \ 

“  (  S  +  Y2  ) 

n=0  \n=0  n=l  fir 

- - - - - —  v  REF 


2  bC0 

This  error  has  a  variance  of 


2bCo 

fa.b-l  -  £  $n 

,,_1  VrBF. 


02) 


£  [AKJ,  (2‘->)]  =  vgEF. 


03) 


Comparing  (ID  and  ( ! 3)  shows  that  the  standard  deviation  of 
the  worst  case  DNL  is  \/2  lower  for  the  split  capacitor  array. 
Conceptually,  this  occurs  because  the  errors  at  y  =  2h~l  and 
y  -  2h~  1  “  I  are  partially  correlated  for  the  split  capacitor  array, 


Fig.  10.  Behavioral  simulation  comparing  the  1  meanly  of  the  splu  and  con¬ 
vent  ion  a  I  capacitor  arrays.  10000  Monte  Carlo  runs  were  performed,  with  i  i  d. 
Gaussian  errors  in  the  unit  capacitors  (itq/Co  ~  3‘/ ),  The  standard  deviation 
of  the  INL  and  DNL  are  plotted 


causing  the  cancellation  of  $&iq, - htb- 2  in  (12),  This  can  be 

also  be  seen  in  the  energy  example  above.  In  Fig.  4,  the  errors 
of  the  top  capacitors  are  completely  uncorrelated  for  the  two 
bit  decisions;  however,  in  Fig.  7,  the  error  of  C2  0  contributes 
equally  to  both  bit  decisions, 

A  behavioral  simulation  of  the  SAR  ADC,  with  both  the  bi¬ 
nary  weighted  and  split  capacitor  arrays,  was  performed.  The 
values  of  the  unit  capacitors  are  taken  to  be  Gaussian  random 
variables  with  standard  deviation  of  3%  (a® /Co  —  0.03),  and 
the  ADC  is  otherwise  ideal.  Fig.  10  shows  the  results  of  10000 
Monte  Carlo  runs,  where  the  standard  deviation  of  the  INL  and 
DNL  are  plotted  versus  output  code  at  the  5-bit  level.  As  ex¬ 
pected,  the  conventional  and  split  arrays  have  identical  INL 
characteristics,  and  the  split  capacitor  array  has  better  DNL. 
This  improvement  in  DNL  is  similar  to  that  conferred  at  the 
MSB  transition  from  using  I -bit  of  unary  decoding  in  a  seg¬ 
mented  DAC  [24]. 

The  above  discussion  assumes  that  the  errors  in  the  unit  ca¬ 
pacitors  are  due  to  an  i,Ld,  random  process.  In  practice,  care 
must  be  taken  during  layout  10  ensure  absence  of  systematic 
non  idealities.  The  unit  capacitors  are  arranged  in  a  common 
centroid  configuration  to  eliminate  the  effect  of  first  order  gra¬ 
dients.  Fringing  effects  at  the  edge  of  the  array  are  reduced  by 
using  32  dummy  capacitors  around  the  32  active  unit  capacitors. 
The  largest  capacitors  in  the  main  subarray  and  MSB  subarray 
are  distributed  so  as  to  have  equal  numbers  of  edges  next  to  the 
dummy  capacitors  to  further  reduce  fringing  errors.  The  split  ca¬ 
pacitor  array  does  have  twice  as  many  bottom  plate  signals  that 
must  be  routed  within  the  array.  Coupling  from  these  routes  to 
the  top  plate  routing  can  cause  linearity  errors  and  was  avoided 
by  routing  the  top  and  bottom  plate  signals  distant  from  each 
other,  which  was  sufficient  at  5-bit  resolution.  For  higher  resolu¬ 
tions,  electrostatic  shielding  may  be  necessary  where  the  bottom 
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Fig  II  Comparator  schematic  showing  preamplifier  chain,  latch,  and  VDL  inserted  in  series  with  the  latch  strobe  signal. 


plate  routing  is  separated  from  the  capacitors  by  grounded  metal 
[25].  Shielding  can  also  improve  immunity  to  noise  coupling 
from  the  substrate. 

8.  Comparator  With  Adjustable  Strobing 

The  comparator*  shown  in  Fig.  I  lf  has  a  regenerative  latch 
preceded  by  two  stages  of  autozeroed  preamplifiers*  used  to  re¬ 
duce  the  input  referred  offset  of  the  latch  to  below  one  quarter  of 
the  LSB  voltage.  The  preamplifiers  are  linear  amplifiers  with  an 
input  NFET  differential  pair  Mi-M2  and  resistive  loads*  formed 
by  PFETs  operating  in  the  linear  region.  The  gain  per 

stage  is  selected  to  be  3-4  for  ease  of  integration  at  both  low 
voltages  and  with  very  short  channel  devices.  The  offset  of  the 
first  preamplifier  is  cancelled  using  output  offset  storage.  The 
sizing  of  the  preamplifiers,  autozeroing  capacitors,  and  latch 
follows  the  offset/m  ate  hiog-1 i  m  i  ted  optimization  procedure  de¬ 
scribed  in  [26]. 

During  bit-cycling*  the  dock  period  is  divided  into  one  pha*se 
for  the  settling  of  the  DAC  and  preamplifiers  and  one  phase  for 
regeneration  of  the  latch*  The  latch  typically  resolves*  even  for 
small  inputs,  in  much  less  than  the  I  ns  that  is  allocated  assuming 
an  even  division  of  the  period.  The  ADC  sits  idle  after  the  latch 
settles  until  the  start  of  the  next  bit-cycle.  Self-timed  bit-cycling 
uses  this  idle  time  to  start  the  next  bit-cycle  early  []  8]*  [27].  This 
approach  relaxes  the  preamplifier  settling  time  requirement  for 
all  but  the  first  bit-cycle  (determining  the  MSB),  as  it  has  no 
prior  bit-cycle  from  which  to  borrow.  Instead,  here  a  variable 
delay  line  (VDL)  has  been  inserted  in  series  w  ith  the  latch  strobe 
signal  to  extend  analog  settling  time  in  the  first  half  of  every 
bit-cycle,  including  the  first*  "pre-borrowing"  time  from  that 
bit-cycle's  ow  n  latch  phase.  The  beginning  of  every  bit  period 
is  synchronous  with  the  sampling  clock*  and  the  latch  strobing 
is  determined  by  the  setting  of  the  VDL*  which  is  tuned  exter¬ 
nally  to  see  tradeoffs  between  extended  settling  time  and  ADC 
performance. 

C.  Technology  Considerations 

The  SAR  architecture's  digital  complexity  directly  benefits 
from  the  reduced  feature  sizes.  Even  though  this  ADC  uses  a 
fully  static  CMOS  logic  style*  it  still  consumes  less  power  than 
the  highly  customized  logic*  including  dynamic  registers,  used 


in  [18],  Care  was  taken  throughout  the  digital  logic  to  provide 
the  maximum  robustness  in  presence  of  delay  variations. 

The  two  analog  blocks  are  w-ell  suited  for  integration  in  65-nin 
CMOS  with  the  following  design  considerations.  For  the  same 
absolute  device  size*  transistor  matching  improves  in  successive 
technology  generations,  allowing  smaller  total  device  area  and 
capacitance  in  the  comparators  [28]:  however,  the  matching  is 
not  improved  for  minimum  size  devices.  Also*  due  to  the  re¬ 
duced  power  supplies  and  decreased  gmrc  of  the  short  channel 
devices,  it  is  difficult  to  get  high  gain  in  a  single  analog  stage. 
The  preamplifiers  and  latch  use  non-minimuni  length  transistors 
to  improve  both  the  matching  and  output  impedance.  While  this 
does  increase  device  capacitance  for  the  same  (/m.  there  is  min¬ 
imal  power  impact  because  wiring  parasitics  dominate  the  total 
capacitance  in  the  comparator. 

The  capacitor  array  is  entirely  passive,  and  its  switching 
speed  is  improved  with  the  shorter  gate  lengths.  Because  no 
analog-specific  processing  steps  (e.g„  a  thin  oxide  for  high 
density  MiM  capacitors)  were  used  in  fabrication  the  capacitors 
are  formed  using  interdigitated  metal  comb  capacitors.  The 
capacitance  i*s  determined  by  fringing  between  adjacent  metal 
lines,  structures  that  have  been  shown  to  achieve  similar  den¬ 
sities  to  MiM  capacitors  with  matching  limits  at  greater  than 
the  7 -bit  level  [29],  The  capacitance  size  is  chosen  according 
to  the  matching  requirements  discussed  in  Section  III- A,  The 
input  voltage  is  constrained  to  between  0  and  400m V  to  allowp 
sampling  with  a  single  standard- V'V  NFET  transistor  without 
exceeding  the  process  voltage  limit  of  L2  V. 

IV  Measurements 

The  ADC  has  been  fabricated  in  a  65 -nm  CMOS  technology; 
a  die  photograph  is  shown  in  Fig.  12,  With  a  91  -kHz  input 
sampled  at  500  MS/s*  the  INL  and  DNL  are  —0,16/0.15  and 
-0.20/0.26  LSBs,  respectively  (Fig,  13),  The  split  capacitor 
array  suffers  no  linearity  degradation  as  compared  to  a  separate 
on-chip  test  channel  with  the  conventional  array.  The  split  array 
uses  3 1  %  less  power  from  the  400  mV  reference  voltage  supply; 
the  difference  in  energy  savings  from  the  theory  presented  above 
is  due  to  the  increased  bottom- pi  ate  routing. 

The  delay  line  was  tested  using  an  on-chip  delay  detection 
circuit  and  varying  the  input  differential  voltage*  Due  to  an  un- 
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Fig.  12.  Photograph  of  1.9  x  L4  mm  die. 
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Fig.  13  Static  linearity  of  ADC  versus  output  code- 


derestimation  of  parasitics  in  the  delay  line,  only  the  first  two 
delay  steps  out  of  16  provided  sufficient  time  for  latch  regener¬ 
ation,  and  these  extended  the  period  available  to  the  preampli- 
tiers  by  about  10%.  At  250  MS/s,  a  0.5-1  dB  improvement  in 
SNDR  was  achieved  by  properly  tuning  the  delay. 

The  dynamic  performance  of  the  ADC  is  shown  in  Fig.  14 
with  the  input  frequency  swept  from  DC  to  beyond  Nyquist, 
The  signal- to-noise-plus-distortion  ratio  (SNDR)  does  not  drop 
by  3  dB  until  past  the  Nyquist  frequency.  A  fast  Fourier  trans¬ 
form  (FFT)  of  a  239.04-MHz  input  is  shown  in  Fig.  15.  Spurs 
(a)— (d)  result  from  gain  errors  and  skewf  between  channels,  and 
spurs  (eMO  are  due  to  offset  mismatch.  All  of  these  spurs  are 
below  -39  dESFS,  and  their  combined  power  is  still  less  than 
the  total  noise  power  (excluding  the  spurs)  at  this  near-Nyquist 
input.  The  gain  mismatch  between  channels  is  0,9%.  The  in¬ 
dividual  channels  have  an  effective  number  of  hits  (ENQB)  be¬ 
tween  4.65  and  4.75  with  low- frequency  inputs,  dropping  by  0.4 
bits  at  Nyquist. 

The  ADC  consumes  2.86  mW  and  3.06  mVV.  respectively, 
from  L2-V  analog  and  digital  supplies  at  the  maximum  sam¬ 
pling  frequency.  The  ADC  was  also  tested  at  lower  sampling  fre¬ 
quencies.  At  250  MS/s.  the  ADC  consumes  a  total  of  1.58  mW 
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Fig.  14.  Dynamic  performance  versus  input  frequency 
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TABLE  I 

Summary  of  Performance 


Technology 

65-nm  CMOS  1P6M 

Supply  Voltage 

i  ,2V 

Sampling  Rate 

500  MSA 

Resolution 

5  bit 

Input  Range 

SOG  mVpp  Differential 

SNDR  (fjn=3.3  MHz) 

27.KdB 

SNDR  (firi  =239  MHz) 

26.1  dB 

SFDR  (F,„=2.19  MHz) 

36.0  dB 

THD  (fm =239  MHz) 

-415  dB 

DNL  (channel) 

026  LSB 

INL  (channel) 

0.16LSB 

Analog  Power  2.86  mW 

Dig! lal  Power 

3  06  mW 

Total  Power 

5  93  mW 

Active  Area 

0.65  mm  x  1.4  mm 

from  a  l  V  digital  and  0.8  V  analog  supply,  while  still  main¬ 
taining  Nyquist  performance.  A  summary  of  the  ADC  is  listed 
in  Table  L 
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TABLE  [] 

Comparison  of  State-of-the-art  ADCs 


Work 

Archite¬ 

cture 

Feature 

Size 

Power 

(mW) 

Is 

(MHz) 

Res¬ 

olution 

(bits) 

(MHz) 

KNOB 

H  At 

(pj/conv. 

step) 

[I7| 

SAR 

90  nm 

10 

600 

6 

300 

5,1 

6.5 

[19] 

SAR 

04  3  pm 

5,3 

600 

6 

300 

5.02 

0.27 

130] 

Subranging 

0,1 3  /im 

21 

125 

8 

62.5 

7.5 

0.96 

[13] 

Pipelined 

0. 1 8p,m 

30 

200 

8 

99 

7.68 

0.74 

[31] 

Subranging 

90  nm 

55 

1000 

6 

500 

5.3 

1.37 

132 1 

Rash 

90  nm 

2.5  1250 

4 

625 

3,66 

0.16 

This 

SAR 

65  nm  , 

5.9 

500 

5 

239 

4.04 

0,75 

work 

SAR 

65  nm 

1.8 

250 

5 

120 

4,10 

0.44 

SAR 

65  nm 

0.9 

125 

5 

60 

3.95 

0.51  ] 

V.  Comparison  and  Discussion 

To  enable  a  comparison  to  other  ADCs  operating  at  different 
speeds  and  resolutions,  the  figure  of  merit 


FOM  = 


P 

2enob  .  2  .  /in 


(14) 


is  used  [I7]t  where  P  is  the  power  consumption,  and  ENOB  is 
measured  for  input  frequency  /in,  not  to  exceed  Nyquist  input. 
Table  II  compares  state- of- the -an  ADCs  with  sampling  rates 
in  excess  of  100  MS/s  and  resolutions  of  8  bits  or  less.  From 
the  results,  this  ADC  has  one  of  the  best  energy  efficiencies  of 
published  work.  In  addition,  as  three  out  of  the  four  best  designs 
demonstrate,  the  time- interleaved  SAR  architecture  can  achieve 
very  low  power  for  these  specifications.  This  work  requires  no 
linearity  calibration  or  digital  postprocessing  of  the  samples. 


VI.  Conclusion 

An  ADC  targeted  for  UWB  specifications  has  been  presented. 
The  time-interleaved  SAR  architecture  provides  superior  en¬ 
ergy  efficiency  to  a  flash  convener  because  of  its  linear  growth 
in  complexity  with  the  resolution.  Two  new'  techniques  have 
enabled  high-speed,  low-power  SAR  operation.  The  split 
capacitor  array  offers  both  lower  switching  energy  and  im¬ 
proved  settling  speed  as  compared  to  the  conventional  array. 
Joint  timing  design  of  the  analog  and  digital  portions  of  the 
chip,  as  demonstrated  with  the  adjustable  latch  strobing  instant, 
can  ease  settling  time  requirements  and  use  otherwise  wasted 
idle  time  during  bit-cycling.  State-of-the-art  energy  efficiency 
and  performance  have  been  demonstrated  with  robust  operation 
in  deep-submicron  CMOS, 
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